Nutch

PredictiveAnalyticsToday ReviewDesk

3 weeks ago

Apache Nutch is a highly extensible and scalable open source web crawler software project. Being pluggable and modular of course has its benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users. Nutch 1.0 requires Java 6 and up. nutch-default.xml is the out of the box configuration for Nutch, and most configurations can stay as per. nutch-site.xml is where users make the changes that override the default settings.

Nutch maintains a crawldb of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if Nutch relied on the index instead of maintaining its own crawldb, to store the same url twice. The problem faced here is what Nutch would do if users wished to change the Solr core and which to index to? This could be done with Nutch 2.0 by adding a SOLR backend to GORA. SOLR would be used to store the webtable and provided that users setup the schema accordingly, users could index the appropriate fields for searching. Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method for users is to mount their shares, then index the contents as though they were local directories.