Scrapy is an open source and collaborative framework for extracting the data you need from websites.
Web Scraping Tools
Built-in support for selecting and extracting data from HTML/XML sources
Built-in support for generating feed exports in multiple formats
Robust encoding support and auto-detection
Strong extensibility support
Wide range of built-in extensions and middlewares
Small (<50 employees), Medium (50 to 1000 employees), Enterprise (>1001 employees)
Scrapy is an open source and collaborative framework for extracting the data that users need from websites done in a fast, simple, yet extensible way. Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler. Scrapy is supported under Python 2.7 and Python 3.3+. Python 2.6 support was dropped starting at Scrapy 0.20.
Python 3 support was added in Scrapy 1.1. Scrapy lets its user write the rules to extract the data and everything that follows get worked on by the program. It is extensible by design as users can easily plug new functionality without having to touch the core. Scrapy is written in Python and runs on Linux, Windows, Mac and BSD which makes it compatible with different systems.
Users can also deploy their spider webs to Scrapy Cloud. Scrapy Cloud is a battle-tested platform for running web crawlers (aka. spiders). These spiders run in the cloud and scale on demand, from thousands to billion of pages. With a point and click tool (Portia), which is also open source and extensible, users will not have a hard time managing their data. Users also get to manage their spiders from a dashboard and schedule them to run automatically.