The Optimized Row Columnar (ORC) file is a columnar storage format for Hive. Specific Hive configuration settings for ORC formatted tables can improve query performance resulting in faster execution and reduced usage of computing resources. Some of these settings may already be turned on by default, whereas others require some educated guesswork.
The table below compares Tez job statistics for the same Hive query that was submitted without and with certain configuration settings. Notice the performance gains with optimization. This article will explain how the performance improvements were achieved.
Source Data:
The HiveQL is ranking each page per user by how many times the user viewed that page for a specific date and within the United States. Breakdown of the query:
ANALYZE TABLE page_views_orc COMPUTE STATISTICS FOR COLUMNS;
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.tez.auto.reducer.parallelism=true;
SET hive.tez.max.partition.factor=20;
SET hive.exec.reducers.bytes.per.reducer=128000000;
Let’s look at each of the Hive settings.
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
Partition and column statistics from fetched from the metastsore. Use this with caution. If you have too many partitions and/or columns, this could degrade performance.
SET hive.tez.auto.reducer.parallelism=true;
SET hive.tez.max.partition.factor=20;
SET hive.exec.reducers.bytes.per.reducer=128000000;
This last set is important. The first run produced 73 output files with each file being around 12.5 MB in size. This is inefficient as I explained earlier. With the above settings, we are basically telling Hive an approximate maximum number of reducers to run with the caveat that the size for each reduce output should be restricted to 128 MB. Let's examine this:
Hi,
I tried using the above hive settings for my orc table but I am not getting any performance improvements.
My table contains around 5.2 millions records in one partition.
Thanks.
Investigating LLAP cache hit rate
Hive table compression: bz2 vs Text vs Orc vs Parquet
Hive on Tez Performance Tuning - Determining Reducer Counts
LLAP debugging overview - logs, UIs, etc
ORC Improvements for Apache Spark 2.2
Using Regular Expressions to Extract Fields for Hive Tables
Investigating when LLAP doesn’t start
Investigating when the queries on LLAP are slow or stuck
This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.
HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service
© 2011-2019 Hortonworks Inc. All Rights Reserved.
Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.