Business Intelligence

Now Reading

Apache Crunch

Next
Prev

Review

Apache Crunch

Overview

Synopsis

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Category

ETL Software Free

Features

Multi-faceted
Easy to use
Integrated
Supports various WriteModes

License

Proprietary Software

Price

Contact for Pricing

Pricing

Subscription

Free Trial

Available

Users Size

Small (<50 employees), Medium (50 to 1000 Enterprise (>1001 employees)

Website

Apache Crunch

Company

Apache Crunch

PAT Rating™

Editor Rating

Aggregated User Rating

Rate Here

Ease of use

8.4

8.5

Features & Functionality

8.4

7.1

Advanced Features

8.5

8.4

Integration

8.5

Performance

8.6

5.0

Customer Support

8.6

8.0

Implementation

8.4

Renew & Recommend

—

Bottom Line

Apache Crunch library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns.

8.5

Editor Rating

7.7

Aggregated User Rating

2 ratings

You have rated this

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. It runs on top of the Hadoop MapReduce and Apache Spark, and its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.Crunch supports different output options via the WriteMode menu, which can be passed along with a Target to the write method on either PCollection or Pipeline.

Listed below are the some of the supported WriteModes:Many of the most common aggregation patterns in Crunch are provided as methods on the PCollection interface, including count, max, min, and length.

The implementations of these methods, however, are in the Aggregate library class. The methods in the Aggregate class expose some additional options that perform aggregations, such as controlling the level of parallelism for count operations.Joins in Crunch are based on equal-valued keys in different PTables.

They have also evolved a great deal in Crunch over the lifetime of the project. The Join API provides simple methods for performing equijoins, left joins, right joins, and full joins. However, modern Crunch joins are usually performed using an explicit implementation of the JoinStrategy interface, which has support for the same rich set of joins that you can use in tools like Apache Hive and Apache Pig.After joins and cogroups, sorting data is the most common distributed computing pattern.

The Crunch APIs have several utilities for performing fully distributed sorts and more advanced patterns such as secondary sorts.Many MapReduce jobs can generate many small files that could be used more effectively by clients if they were all merged together into smaller large files. The Shard API allows users to coalesce a given PCollection into a few partitions.

Filter reviews