Bigdata

Now Reading

Apache Parquet

Next
Prev

Review

Apache Parquet

Overview

Synopsis

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Category

Column-oriented DBMS

Features

• Automatic dictionary encoding enabled dynamically for data with a small number of unique values
• Run-length encoding (RLE)
• Per-column data compression accelerates performance
• Queries that fetch specific column values need not read the entire row data thus improving performance
• Dictionaries/hash tables, indexes, bit vectors
• Repetition levels

License

Proprietary

Price

Contact for Pricing

Pricing

Subscription

Free Trial

Available

Users Size

Small (<50 employees), Medium (50 to 1000 Enterprise (>1001 employees)

Company

Apache Parquet

What is best?

• Automatic dictionary encoding enabled dynamically for data with a small number of unique values
• Run-length encoding (RLE)
• Per-column data compression accelerates performance

What are the benefits?

• Storage efficiency: Parquet files only have a fifth of the size of the original (UTF-8 encoded) CSVs
• Checksumming: Allows disabling of checksums at the HDFS file level, to better support single row lookups
• Different encoding techniques can be applied to different columns
• Bit packing: Storage of integers is usually done with dedicated 32 or 64 bits per integer

PAT Rating™

Editor Rating

Aggregated User Rating

Rate Here

Ease of use

7.6

8.1

Features & Functionality

7.6

5.9

Advanced Features

7.6

7.5

Integration

7.6

8.3

Performance

7.6

Customer Support

7.6

Implementation

6.9

Renew & Recommend

8.8

Bottom Line

Apache Parquet, which provides columnar storage in Hadoop, is a top-level Apache Software Foundation (ASF)-sponsored project, paving the way for its more advanced use in the Hadoop ecosystem.

7.6

Editor Rating

8.2

Aggregated User Rating

6 ratings

You have rated this

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. Parquet is a self-describing data format that embeds the schema or structure within the data itself. This results in a file that is optimized for query performance and minimizing I/O. Parquet also supports very efficient compression and encoding schemes. Apache Parquet is designed to bring efficient columnar storage of data compared to row-based files like CSV. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. Parquet is built to be used by anyone. An efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies. To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many optional fields in the path for the column are defined. Repetition levels specify at what repeated field in the path was the value repeated. Apache Parquet allows for lower data storage costs and maximized effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum, and Google Dataproc. Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store the data.

Filter reviews