Sign in to see all reviews and comparisons. It's Free!
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
• Automatic dictionary encoding enabled dynamically for data with a small number of unique values • Run-length encoding (RLE) • Per-column data compression accelerates performance • Queries that fetch specific column values need not read the entire row data thus improving performance • Dictionaries/hash tables, indexes, bit vectors • Repetition levels
Contact for Pricing
Small (<50 employees), Medium (50 to 1000 Enterprise (>1001 employees)
What is best?
• Automatic dictionary encoding enabled dynamically for data with a small number of unique values • Run-length encoding (RLE) • Per-column data compression accelerates performance
What are the benefits?
• Storage efficiency: Parquet files only have a fifth of the size of the original (UTF-8 encoded) CSVs • Checksumming: Allows disabling of checksums at the HDFS file level, to better support single row lookups • Different encoding techniques can be applied to different columns • Bit packing: Storage of integers is usually done with dedicated 32 or 64 bits per integer
Aggregated User Rating
Ease of use
Features & Functionality
Renew & Recommend
Apache Parquet, which provides columnar storage in Hadoop, is a top-level Apache Software Foundation (ASF)-sponsored project, paving the way for its more advanced use in the Hadoop ecosystem.
Aggregated User Rating
You have rated this
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. Parquet is a self-describing data format that embeds the schema or structure within the data itself. This results in a file that is optimized for query performance and minimizing I/O. Parquet also supports very efficient compression and encoding schemes. Apache Parquet is designed to bring efficient columnar storage of data compared to row-based files like CSV. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. Parquet is built to be used by anyone. An efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies. To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many optional fields in the path for the column are defined. Repetition levels specify at what repeated field in the path was the value repeated. Apache Parquet allows for lower data storage costs and maximized effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum, and Google Dataproc. Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store the data.
Every day, thousands of potential buyers including CEO's, CIO's, Directors, and Executives use PAT RESEARCH.
PAT RESEARCH is a B2B discovery platform which provides Best Practices, Buying Guides, Reviews, Ratings, Comparison, Research, Commentary, and Analysis for Enterprise Software and Services. We provide Best Practices, PAT Index™ enabled product reviews and user review comparisons to help IT decision makers such as CEO’s, CIO’s, Directors, and Executives to identify technologies, software, service and strategies.