I had couple of questions on the file compression. We plan on using ORC format for a data zone that will be heavily accessed by the end-users via Hive/JDBC.
What is the recommendation when it comes to compressing ORC files?
Do you think Snappy is a better option (over ZLIB) given Snappy’s better read-performance? (Snappy is more performant in a read-often scenario, which is usually the case for Hive data.) When would you choose zlib?
As a side note: Compression is a double-edged sword, as you can go also have performance issue going from larger file sizes spread among multiple nodes to the smaller size & HDFS block size interactions. You can blunt this by using compression strategy.
Answer by gopal · Nov 18, 2015 at 06:00 AM
David's post is from 2014. Since then we switched away from standard Zlib in ORC.
See the slides from ORC 2015: Faster, Better, Smaller
Each column type (like string, int etc) get different Zlib compatible algorithms for compression (i.e different trade-offs of RLE/Huffman/LZ77).
ORC+Zlib after the columnar improvements no longer has the historic weaknesses of Zlib, so it is faster than SNAPPY to read, smaller than SNAPPY on disk and only ~10% slower than SNAPPY to write it out.
Answer by Jonas Straub · Nov 16, 2015 at 09:15 PM
ORC+ZLib seems to have the better performance. ZLib is also the default compression option, however there are definitely valid cases for Snappy.
I like the comment from David (2014, before ZLib Update) "SNAPPY for time based performance, ZLIB for resource performance (Drive Space)." Make sure you checkout David's post: https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance
As @gopal pointed out in the comment, we have switched to a new ZLib algorithm, hence the combination ORC + (new) ZLib is the way to go. The performance difference of ZLib and Snappy regarding disk writes is rather small.
Btw. ZLib is not always the better option, when it comes to HBase, Snappy is usually better :)
hive insert into table from query 1 Answer