Bigdata

Now Reading

GigaBlast

Next
Prev

Review

GigaBlast

Overview

Synopsis

Gigablast provides large-scale, high-performance, real-time information retrieval technology and services for partner sites.

Category

Search Engine Server Free

Features

• The ONLY open source WEB search engine.
• 64-bit architecture.
• Scalable to thousands of servers.
• Has scaled to over 12 billion web pages on over 200 servers.
• A dual quad core, with 32GB ram, and two 160GB Intel SSDs, running 8 Gigablast instances, can do about 8 qps
• 1 million web pages requires 28.6GB of drive space. That includes the index, meta information and the compressed HTML of all the web pages. That is 28.6K of disk per HTML web page.
• Spider rate is around 1 page per second per core. So a dual quad core can spider and index 8 pages per second which is 691,200 pages per day.
• 4GB of RAM required per Gigablast instance. (instance = process)
• Live demo at http://www.gigablast.com/
• Written in C/C++ for optimal performance.
• Over 500,000 lines of C/C++.
• 100% custom. A single binary. The web server, database and everything else is all contained in this source code in a highly efficient manner. Makes administration and troubleshooting easier.
• Reliable. Has been tested in live production since 2002 on billions of queries on an index of over 12 billion unique web pages, 24 billion mirrored.
• Super fast and efficient. One of a small handful of search engines that have hit such big numbers. The only open source search engine that has.
• Supports all languages. Can give results in specified languages a boost over others at query time. Uses UTF-8 representation internally.
• Track record. Has been used by many clients. Has been successfully used in distributed enterprise software.
• Cached web pages with query term highlighting.
• Shows popular topics of search results (Gigabits), like a faceted search on all the possible phrases.

License

Open Source

Price

Free

Pricing

Subscription

Free Trial

Available

Users Size

Small (<50 employees), Medium (50 to 1000 Enterprise (>1001 employees)

Website

GigaBlast

Company

GigaBlast

What is best?

What are the benefits?

• Indexes anchor text of inlinks to a web page and uses many techniques to flag pages as link spam thereby discounting their link weights.
• Demotes web pages if they are spammy.
• Can cluster results from same site.
• Duplicate removal from search results.
• Distributed web crawler/spider. Supports crawl delay and robots.txt.
• Crawler/Spider is highly programmable and URLs are binned into priority queues. Each priority queue has several throttles and knobs.
• Spider status monitor to see the urls being spidered over the whole cluster in a real-time widget.
• Complete REST/XML and JSON API for doing queries as well as adding and deleting documents in real-time.
• Automated data corruption detection, fail-over and repair based on hardware failures.
• Custom Search. (aka Custom Topic Search). Using a cgi parm like &sites=abc.com+xyz.com you can restrict the search results to a list of up to 500 subdomains.
• DMOZ integration. Run DMOZ directory. Index and search over the pages in DMOZ. Tag all pages from all sites in DMOZ for searching and displaying of DMOZ topics under each search result.
• Collections. Build tens of thousands of different collections, each treated as a separate search engine. Each can spider and be searched independently.
• Federated search over multiple Gigablast collections using syntax like &c=mycoll1+mycoll2+mycoll3+...
• Plug-ins. For indexing any file format by calling Plug-ins to convert that format to HTML. Provided binary plug-ins: pdftohtml (PDF), ppthtml (PowerPoint), antiword (MS Word), pstotext (PostScript).
• Easy Scaling. Add new servers to the hosts.conf file then click 'rebalance shards' to automatically rebalance the sharded data.
• Using &stream=1 can stream back millions of search results for a query without running out of memory.
• Makes and displays thumbnail images in the search results.
• Nested boolean queries using AND, OR, NOT operators.
• Built-in support for diffbot.com's api, which extracts various entities from web sites, like products, articles, etc. But you will need to get a free token from them for access to their API.
• Facets over meta tags or X-Paths for HTML documents.
• Facets over JSON and XML fields.
• Indexes JSON and XML natively. Provides ability to search individual structured fields.
• Sorting. Sort the search results by meta tags or JSON fields that contain numbers, simply by adding something like gbsortby:price or gbrevsortby:price as a query term, assuming you have meta price tags or a JSON field called price.
• Constrain by numeric fields in JSON or XML using synax like gbmaxint:trees:121 or gbminfloat:price:30.0.
• Built-in real-time profiler.
• Built-in QA tester.
• Can inject WARC and ARC archive files from the command line or web GUI.
• Can inject html, text, WARC or ARC documents from the command line, like ./gb inject foo.txt or ./gb inject test.warc.
• The most starred C/C++ based search engine on github

PAT Rating™

Editor Rating

Aggregated User Rating

Rate Here

Ease of use

7.6

6.1

Features & Functionality

7.6

8.8

Advanced Features

7.6

9.2

Integration

7.6

9.6

Performance

7.6

8.9

Training

5.7

Customer Support

7.6

—

Implementation

—

Renew & Recommend

—

Bottom Line

7.6

Editor Rating

8.1

Aggregated User Rating

2 ratings

You have rated this

Gigablast provides large-scale, high-performance, real-time information retrieval technology and services for partner sites. Gigablast offers a variety of features including topic generation and the ability to index multiple document formats. This search delivery mechanism gives a partner "turn key" search capability and the capacity to instantly offer search at maximum scalability with minimum cost. Clients range from NASDAQ 100 listed corporations to boutique companies. Gigablast is one of a handful of search engines in the United States that maintains its own searchable index of over a billion pages. Gigablast can store 100,000 web pages (each around 25k in size) per gigabyte of disk storage. A typical single-cpu pentium 4 machine can index one to two million web pages per day even when Gigablast is near its maximum document capacity for the hardware. A cluster of N such machines can index at N times that rate. Gigablast, reads and writes a lot of data at the same time under heavy spider and query loads, therefore disk will probably be the users’ MAJOR bottleneck.

Gigablast is smart enough to split the load evenly between mirrors when processing queries. Users can send their queries to any shard and it will communicate with all the other shards to aggregate the results. If one shard fails and users are not mirroring then users will lose that part of the index, unfortunately. The name of Gigablast's spider is Gigabot, but it by default uses GigablastOpenSource as the name of the User-Agent when downloading web pages. Gigabot respects the robots.txt convention (robot exclusion) as well as supporting the meta noindex, noarchive and nofollow meta tags. Users can tell Gigabot to ignore robots.txt files on the Spider Controls page.

Filter reviews