Bigdata
Now Reading
Nutch
0
Review

Nutch

Overview
Synopsis

Apache Nutch is a highly extensible and scalable open source web crawler software project. Being pluggable and modular of course has its benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations .

Category

Search Engine Server Free

Features

• MapReduce ;
• Distributed filesystem (via Hadoop)
• Link-graph database
• NTLM authentication

License

Open Source

Price

Free

Pricing

Subscription

Free Trial

Available

Users Size

Small (<50 employees), Medium (50 to 1000 Enterprise (>1001 employees)

Website
Company

Nutch

What is best?

• Fetching and parsing are done separately by default, this reduces the risk of an error
• Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search.
• The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument

PAT Rating™
Editor Rating
Aggregated User Rating
Rate Here
Ease of use
7.6
4.7
Features & Functionality
7.6
6.3
Advanced Features
7.6
4.2
Integration
7.6
Performance
7.6
0.0
Training
Customer Support
7.6
Implementation
8.9
Renew & Recommend
10
Bottom Line

Nutch maintains a crawldb of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the re-crawling period.

7.6
Editor Rating
5.7
Aggregated User Rating
3 ratings
You have rated this

Apache Nutch is a highly extensible and scalable open source web crawler software project. Being pluggable and modular of course has its benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users. Nutch 1.0 requires Java 6 and up. nutch-default.xml is the out of the box configuration for Nutch, and most configurations can stay as per. nutch-site.xml is where users make the changes that override the default settings.

Nutch maintains a crawldb of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the re-crawling period. At the same time Solr maintains an inverted index of all the fetched pages. It'd seem more efficient if Nutch relied on the index instead of maintaining its own crawldb, to store the same url twice. The problem faced here is what Nutch would do if users wished to change the Solr core and which to index to? This could be done with Nutch 2.0 by adding a SOLR backend to GORA. SOLR would be used to store the webtable and provided that users setup the schema accordingly, users could index the appropriate fields for searching. Nutch does not have built in support for accessing files over SMB (Windows) shares. This means the only available method for users is to mount their shares, then index the contents as though they were local directories.

 

Filter reviews
User Ratings





User Company size



User role





User industry





Ease of use
Features & Functionality
Advanced Features
Integration
Performance
Training
Customer Support
Implementation
Renew & Recommend

What's your reaction?
Love It
0%
Very Good
0%
INTERESTED
0%
COOL
0%
NOT BAD
0%
WHAT !
0%
HATE IT
0%