Hortonworks.com
  • Explore
    • All Tags
    • All Questions
    • All Articles
    • All Ideas
    • All Repos
    • All SKB
    • All Users
    • All Badges
    • Leaderboard
  • Create
    • Ask a question
    • Create Article
    • Post Idea
    • Add Repo
  • Tracks
    • All Tracks
    • Community Help
    • Cloud & Operations
    • CyberSecurity
    • Data Ingestion & Streaming
    • Data Processing
    • Data Science & Advanced Analytics
    • Design & Architecture
    • Governance & Lifecycle
    • Hadoop Core
    • Sandbox & Learning
    • Security
    • Solutions
  • Login
HCC Hortonworks Community Connection
  • Home /
  • Data Processing /
  • Home /
  • Data Processing /
avatar image

Hive table compression: bz2 vs Text vs Orc vs Parquet

  • Export to PDF
Article by Willie Engelbrecht · Jan 16 at 01:00 PM
1

Article

Note: this article only deals with the disk space of each format, not the performance comparison.

When using Hive as your engine for SQL queries, you might want to consider using ORC or Parquet file formats for your data. There are numerous advantages to consider when choosing ORC or Parquet. Firstly, both will give you columnar compression of the data within, whereas a plain text file will have no compression at all. Secondly, indexes within ORC or Parquet will help with query speed as some basic statistics are stored inside the files, such as min,max value, number of rows etc. Both formats can also express complex data structures (such as hierarchies) which a plain CSV file cannot do. The combination of these can boost your query many times over. Have a look at the article here to read more about the underlying mechanics of both formats: https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

More information on the Parquet file format: https://parquet.apache.org/documentation/latest/

However, what is the space on disk that is used for these formats in Hive ? Saving on disk space (and inadvertently, IO), is always a good thing, but it can be hard to calculate exactly how much space you will be using with compression. Obviously, every file and data set is different, and the data inside will always be a determining factor for what type of compression you'll get. Text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.

As a simple test, I took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html

The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb

I then uploaded the data to HDFS, and created an external table on top of the uncompressed data set:

Create external table flight_arrivals (
year               int,
month              int,
DayofMonth         int,
DayOfWeek          int,
DepTime            int,
CRSDepTime         int,
ArrTime            int,
CRSArrTime         int,
UniqueCarrier      string,
FlightNum          int,
TailNum            string,
ActualElapsedTime  int,
CRSElapsedTime     int,
AirTime            int,
ArrDelay           int,
DepDelay           int,
Origin             string,
Dest               string,
Distance           int,
TaxiIn             int,
TaxiOut            int,
Cancelled          int,
CancellationCode   int,
Diverted           int,
CarrierDelay       string,
WeatherDelay       string,
NASDelay           string,
SecurityDelay      string,
LateAircraftDelay  string
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/flight_arrivals';

Counting the rows:

select count(*) from flight_arrivals;
+----------+
|   _c0    |
+----------+
| 7009728  |
+----------+
1 row selected (3.341 seconds)

Converting to an ORC and Parquet table, using default settings for both formats:

Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals;
Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals;

Let's have a look at the disk usage for both:

93.5 M   /warehouse/tablespace/external/hive/flight_arrivals_external_orc
146.6 M  /warehouse/tablespace/external/hive/flight_arrivals_external_parquet

In Summary:

Format Size Compressed %
bz2 108.5 Mb 16.5%
CSV (Text) 657.5 Mb -
ORC 93.5 Mb 14.2%
Parquet 146.6 Mb 22.3%

One should keep in mind that default settings and values were used to create the ORC and Parquet tables, as well as no other optimizations were used for either formats. Both ORC and Parquet ships with many options & optimizations to compress your data, only the defaults which ships with HDP 3.1 were used. Consider using a sample or subset of your data set, and play with the various compression algorithms and options for each to find the one that suits your use case best.

thub.nodes.view.add-new-comment
How-To/TutorialHivecompressioncsvhive-orcorcparquet
Add comment
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Up to 5 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Article

Contributors

avatar image

avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image avatar image
avatar image avatar image avatar image avatar image

Navigation

Hive table compression: bz2 vs Text vs Orc vs Parquet

Related Articles

Investigating LLAP cache hit rate

Optimizing Hive queries for ORC formatted tables

HDF/NiFi to convert row-formatted text files to columnar Parquet and ORC

Automatically compress Hive LLAP logs

Using Regular Expressions to Extract Fields for Hive Tables

ORC Improvements for Apache Spark 2.2

Investigating when LLAP doesn’t start

Investigating when the queries on LLAP are slow or stuck

Basic CDC in Hadoop using Spark with Data Frames

Hive on Tez Performance Tuning - Determining Reducer Counts

This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.

HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service

© 2011-2019 Hortonworks Inc. All Rights Reserved.

Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Login
  • Create
  • Ask a question
  • Create Article
  • Post Idea
  • Add Repo
  • Create SupportKB
  • Tracks
  • Community Help
  • Cloud & Operations
  • CyberSecurity
  • Data Ingestion & Streaming
  • Data Processing
  • Data Science & Advanced Analytics
  • Design & Architecture
  • Governance & Lifecycle
  • Hadoop Core
  • Sandbox & Learning
  • Security
  • Solutions
  • Explore
  • All Tags
  • All Questions
  • All Articles
  • All Ideas
  • All Repos
  • All SKB
  • All Users
  • Leaderboard
  • All Badges