Trending
PAT Index™
ALL
PAT Index™
 
1
Octoparse
 
2
Pattern
 
3
Scrapy
 
4
Frontera
 
5
Portia
 
6
DEiXTo
 
7
IEPY
 
8
TheWebMiner
 
9
GNU Wget
Random Articles
 
Top 41 Free and Open Source Customer Relationship Management (CRM) Software
 
Top 23 Free and Open Source Human Resource ( HR) Software
 
Top 34 Human Resource Management ( Core HR) Software
 
Top 24 Predictive Analytics Free Software
 
Top 59 Social Media Management and Analytics Software
 
Predictive Analytics Quadrant_1
What is Predictive Analytics ?
 
Top 27 Free Software for Text Analysis, Text Mining, Text Analytics
 
Top Business Intelligence Tools
Top 238 Free & Premium Business Intelligence Tools
 
Top Free Social Media Analytics Software
Top 27 Free Social Media Management and Analytics Software
 
Top Predictive Analytics Software API
Top 30 Predictive Analytics Software API
 
Predictive Analytics Value Chain
What is Predictive Modeling ?
 
Bigdata Platforms and Bigdata Analytics Software
Top 50 Bigdata Platforms and Bigdata Analytics Software
 
Cloud – SaaS – OnDemand Business Intelligence Solutions
Top 45 Cloud – SaaS – OnDemand Business Intelligence Software
 
Top Free Qualitative Data Analysis Software
Top 21 Free Qualitative Data Analysis Software
Web Scraping Tools Free
Most Recent
 
Read More
June 7, 2017

Frontera

Frontera is an effective code hosting platform for version control and collaboration. It is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler. Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner. The frontier is initialized with a list of start URLs, that are called the seeds. Once the frontier is initialized the crawler asks it what pages should be visited next. As the crawler starts to visit the pages [...]

8.75
 
Read More
June 7, 2017

Scrapy

Scrapy is an open source and collaborative framework for extracting the data that users need from websites done in a fast, simple, yet extensible way. Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler. Scrapy is supported under Python 2.7 and Python 3.3+. Python 2.6 support was dropped starting at Scrapy 0.20. [...]

10.5
 
Read More
June 7, 2017

Portia

Portia is a tool that allows the user to visually scrape websites without any programming knowledge required. With Portia the user can annotate a web page to identify the data that needs to be extracted, and Portia will understand based on these annotations how to scrape data from similar pages. Web scraping involves coding and programming crawlers. If the user is a non-coder person, Portia can help extract web contents easily. This Scrapinghub’s tool lets the user use point&click UI interface to annotate (select) web content for its further scrape and store of it. I’ll go deeper inside Portia later in this post. One can use Portia within a [...]

5.25
 
Read More
June 7, 2017

DEiXTo

DEiXTo is a powerful web data extraction tool that is based on the W3C Document Object Model (DOM). It allows users to create highly accurate extraction rules that describe what pieces of data to scrape from a website. DEiXTo consists of three separate components to help users. GUI DEiXTo is an MS Windows application implementing a friendly graphical user interface that is used to manage extraction rules (build, test, fine-tune, save and modify). This is all that a user needs for small scale extraction tasks. DEiXToBot is a Perl module implementing a flexible and efficient Mechanize agent capable of extracting data of interest using GUI DEiXTo generate [...]

4.75
 
Read More
June 7, 2017

GNU Wget

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support. The recursive retrieval of HTML pages, as well as FTP sites is supported — the user can use Wget to make mirrors of archives and home pages, or traverse the web like a WWW robot (Wget understands /robots.txt). Wget works exceedingly well on slow or unstable connections, keeping getting the document until it is fully retrieved. This allows freedom of movement as the user does not always need to be [...]

3.25
Compare
Go