Bigdata

Now Reading

Cloudera Enterprise Improves Data Processing with Hive-on-Spark Support

Cloudera Enterprise 5.7 release provides leading performance across key workloads - including an average 3x improvement for data processing with added support of Hive-on-Spark, and an average 2x improvement for business intelligence analytics with updates to Apache Impala (incubating). Additionally, this release adds visibility into multi-tenant usage across these workloads for management efficiency and optimal resourcing. Cloudera Enterprise 5.7 is another leap forward for Hadoop as it grows to support new and changing use cases, and indicative of Cloudera’s leadership in ensuring these modern enterprises can fully embrace the platform across the business.

“Hadoop has evolved significantly in the past ten years, and with every advancement, we see the potential for new applications and use cases, while improving what’s already being done,” said Charles Zedlewski, vice president, Products at Cloudera. “The advancement of data engineering and ETL development with Hive-on-Spark marks a critical milestone in this evolution - further solidifying Spark’s status as the standard data processing engine in Hadoop. Data engineering is only a part of the story in today’s business though and, with the 5.7 release, our customers can better enable a wide range of users across the platform, all while maintaining fast performance, easy management, and compliance-ready security.”

ETL development and batch processing remains one of the most common use cases for Hadoop. Apache Hive has long played a key role for these workloads, though traditionally leveraging MapReduce as the underlying execution engine. However, with its easy development and faster performance compared to MapReduce, Apache Spark is playing an increasingly important role and is primed to replace MapReduce for these workloads. Last year Cloudera launched the One Platform Initiative as the roadmap to complete the transition from MapReduce to Spark and they are leading development to better integrate Spark with Hadoop - ensuring it meets the enterprise requirements for even the largest-scale production workloads. With the release of Hive-on-Spark in Cloudera 5.7, it brings Spark one step closer as developers can now leverage the powerful data processing capabilities of Spark, while continuing to use familiar Hive, and delivers a 3x performance improvement on average. Hive-on-Spark is a community-driven initiative launched by Cloudera, IBM, Intel, MapR, and others, and involved customers across a range of industries - including, advertising, financial services, and insurance - as part of an early access program for further development.

For further consistency, Cloudera has worked with their 2,300+ partner ecosystem to ensure customers can continue to use the leading data integration and preparation tools with Hive-on-Spark, without disrupting the business. Partners such as: BMC, ClearStory Data, Elastic, NGDATA, Solix, Trillium Software, Zementis, and others are working with Cloudera to certify their technologies for a seamless transition. (See below for their supporting statements.)

Being able to support multiple use cases across the same, shared data within a single cluster is a key benefit for Hadoop. With Cloudera Enterprise, administrators can easily provide these users and applications with the right resources to run and meet critical Service Level Agreements (SLAs). With this recent release, these administrators get full visibility into historical usage and efficiency across users, tenants, and applications. The new Cluster Utilization Reporting feature, built-into Cloudera Manager ensures efficient operations and proper resource allocation between groups and workload types; helps guarantee SLAs are being met; and provides simple troubleshooting of job and query performance issues.

Additional features in Cloudera 5.7 include:

2x performance improvements for BI analytics: Impala continues to maintain its performance lead as the fastest analytic SQL engine for Hadoop through dynamic partition pruning, faster query startup, runtime filters, and more.

Simplified path to production: Cloudera Manager includes cluster templates that provide a simple workflow to easily replicate configuration settings to new clusters - making it easy to move from a well-tuned test environment to production, scale-out across regions, or quickly revert to a known good configuration when problems occur.

Optimized data governance: Cloudera Navigator opens up data management and governance to the business user with simplified lineage for establishing trust and provenance of data, and adds managed metadata for improved discoverability and consistency across systems.