Sign in to see all reviews and comparisons. It's Free!
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
ETL Software Free
Multi-faceted Easy to use Integrated Supports various WriteModes
Contact for Pricing
Small (<50 employees), Medium (50 to 1000 Enterprise (>1001 employees)
Apache Crunch library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns.
Aggregated User Rating
You have rated this
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. It runs on top of the Hadoop MapReduce and Apache Spark, and its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.Crunch supports different output options via the WriteMode menu, which can be passed along with a Target to the write method on either PCollection or Pipeline.
Listed below are the some of the supported WriteModes:Many of the most common aggregation patterns in Crunch are provided as methods on the PCollection interface, including count, max, min, and length.
The implementations of these methods, however, are in the Aggregate library class. The methods in the Aggregate class expose some additional options that perform aggregations, such as controlling the level of parallelism for count operations.Joins in Crunch are based on equal-valued keys in different PTables.
They have also evolved a great deal in Crunch over the lifetime of the project. The Join API provides simple methods for performing equijoins, left joins, right joins, and full joins. However, modern Crunch joins are usually performed using an explicit implementation of the JoinStrategy interface, which has support for the same rich set of joins that you can use in tools like Apache Hive and Apache Pig.After joins and cogroups, sorting data is the most common distributed computing pattern.
The Crunch APIs have several utilities for performing fully distributed sorts and more advanced patterns such as secondary sorts.Many MapReduce jobs can generate many small files that could be used more effectively by clients if they were all merged together into smaller large files. The Shard API allows users to coalesce a given PCollection into a few partitions.
Every day, thousands of potential buyers including CEO's, CIO's, Directors, and Executives use PAT RESEARCH.
PAT RESEARCH is a B2B discovery platform which provides Best Practices, Buying Guides, Reviews, Ratings, Comparison, Research, Commentary, and Analysis for Enterprise Software and Services. We provide Best Practices, PAT Index™ enabled product reviews and user review comparisons to help IT decision makers such as CEO’s, CIO’s, Directors, and Executives to identify technologies, software, service and strategies.