Sign in to see all reviews and comparisons. It's Free!
By clicking Sign In with Social Media, you agree to let PAT RESEARCH store, use and/or disclose your Social Media profile and email address in accordance with the PAT RESEARCH
and agree to the
Apache Nutch is a highly extensible and scalable open source web crawler software project. Being pluggable and modular of course has its benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations .
• Fetching and parsing are done separately by default, this reduces the risk of an error
• Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search.
• The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument
Aggregated User Rating
Ease of use
Features & Functionality
Renew & Recommend
Nutch maintains a crawldb of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the re-crawling period.
Aggregated User Rating
You have rated this
Apache Nutch is a highly extensible and scalable open source web crawler software project. Being pluggable and modular of course has its benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users. Nutch 1.0 requires Java 6 and up. nutch-default.xml is the out of the box configuration for Nutch, and most configurations can stay as per. nutch-site.xml is where users make the changes that override the default settings.
Nutch maintains a crawldb of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if Nutch relied on the index instead of maintaining its own crawldb, to store the same url twice. The problem faced here is what Nutch would do if users wished to change the Solr core and which to index to? This could be done with Nutch 2.0 by adding a SOLR backend to GORA. SOLR would be used to store the webtable and provided that users setup the schema accordingly, users could index the appropriate fields for searching. Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method for users is to mount their shares, then index the contents as though they were local directories.
PAT RESEARCH is a B2B discovery platform which provides Best Practices, Buying Guides, Reviews, Ratings, Comparison, Research, Commentary, and Analysis for Enterprise Software and Services. We provide Best Practices, PAT Index™ enabled product reviews and user review comparisons to help IT decision makers such as CEO’s, CIO’s, Directors, and Executives to identify technologies, software, service and strategies.