The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
ETL Software Free
Easy to use
Supports various WriteModes
Contact for Pricing
Small (<50 employees), Medium (50 to 1000 employees), Enterprise (>1001 employees)
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. It runs on top of the Hadoop MapReduce and Apache Spark, and its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.Crunch supports different output options via the WriteMode menu, which can be passed along with a Target to the write method on either PCollection or Pipeline.
Listed below are the some of the supported WriteModes:Many of the most common aggregation patterns in Crunch are provided as methods on the PCollection interface, including count, max, min, and length.
The implementations of these methods, however, are in the Aggregate library class. The methods in the Aggregate class expose some additional options that perform aggregations, such as controlling the level of parallelism for count operations.Joins in Crunch are based on equal-valued keys in different PTables.
They have also evolved a great deal in Crunch over the lifetime of the project. The Join API provides simple methods for performing equijoins, left joins, right joins, and full joins. However, modern Crunch joins are usually performed using an explicit implementation of the JoinStrategy interface, which has support for the same rich set of joins that you can use in tools like Apache Hive and Apache Pig.After joins and cogroups, sorting data is the most common distributed computing pattern.
The Crunch APIs have several utilities for performing fully distributed sorts and more advanced patterns such as secondary sorts.Many MapReduce jobs can generate many small files that could be used more effectively by clients if they were all merged together into smaller large files. The Shard API allows users to coalesce a given PCollection into a few partitions.