Sign in to see all reviews and comparisons. It's Free!
By clicking Sign In with Social Media, you agree to let PAT RESEARCH store, use and/or disclose your Social Media profile and email address in accordance with the PAT RESEARCH Privacy Policy and agree to the Terms of Use.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Category
Column-oriented DBMS
Features
• Automatic dictionary encoding enabled dynamically for data with a small number of unique values • Run-length encoding (RLE) • Per-column data compression accelerates performance • Queries that fetch specific column values need not read the entire row data thus improving performance • Dictionaries/hash tables, indexes, bit vectors • Repetition levels
License
Proprietary
Price
Contact for Pricing
Pricing
Subscription
Free Trial
Available
Users Size
Small (<50 employees), Medium (50 to 1000 Enterprise (>1001 employees)
Company
Apache Parquet
What is best?
• Automatic dictionary encoding enabled dynamically for data with a small number of unique values • Run-length encoding (RLE) • Per-column data compression accelerates performance
What are the benefits?
• Storage efficiency: Parquet files only have a fifth of the size of the original (UTF-8 encoded) CSVs • Checksumming: Allows disabling of checksums at the HDFS file level, to better support single row lookups • Different encoding techniques can be applied to different columns • Bit packing: Storage of integers is usually done with dedicated 32 or 64 bits per integer
PAT Rating™
Editor Rating
Aggregated User Rating
Rate Here
Ease of use
7.6
8.1
Features & Functionality
7.6
5.9
Advanced Features
7.6
7.5
Integration
7.6
8.3
Performance
7.6
10
Customer Support
7.6
10
Implementation
6.9
Renew & Recommend
8.8
Bottom Line
Apache Parquet, which provides columnar storage in Hadoop, is a top-level Apache Software Foundation (ASF)-sponsored project, paving the way for its more advanced use in the Hadoop ecosystem.
7.6
Editor Rating
8.2
Aggregated User Rating
6 ratings
You have rated this
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. Parquet is a self-describing data format that embeds the schema or structure within the data itself. This results in a file that is optimized for query performance and minimizing I/O. Parquet also supports very efficient compression and encoding schemes. Apache Parquet is designed to bring efficient columnar storage of data compared to row-based files like CSV. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. Parquet is built to be used by anyone. An efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies. To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many optional fields in the path for the column are defined. Repetition levels specify at what repeated field in the path was the value repeated. Apache Parquet allows for lower data storage costs and maximized effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum, and Google Dataproc. Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store the data.
By clicking Sign In with Social Media, you agree to let PAT RESEARCH store, use and/or disclose your Social Media profile and email address in accordance with the PAT RESEARCH Privacy Policy and agree to the Terms of Use.