Hortonworks.com
  • Explore
    • All Tags
    • All Questions
    • All Repos
    • All SKB
    • All Articles
    • All Ideas
    • All Users
    • All Badges
    • Leaderboard
  • Create
    • Ask a question
    • Add Repo
    • Create Article
    • Post Idea
  • Tracks
    • All Tracks
    • Community Help
    • Cloud & Operations
    • CyberSecurity
    • Data Ingestion & Streaming
    • Data Processing
    • Data Science & Advanced Analytics
    • Design & Architecture
    • Governance & Lifecycle
    • Hadoop Core
    • Sandbox & Learning
    • Security
    • Solutions
  • Login
HCC Hortonworks Community Connection
  • Home /
  • Data Processing /
avatar image

Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files

Question by Ancil McBarnett Nov 16, 2015 at 08:32 PM Hivehiveserver2orccompression

I had couple of questions on the file compression. We plan on using ORC format for a data zone that will be heavily accessed by the end-users via Hive/JDBC.

What is the recommendation when it comes to compressing ORC files?

Do you think Snappy is a better option (over ZLIB) given Snappy’s better read-performance? (Snappy is more performant in a read-often scenario, which is usually the case for Hive data.) When would you choose zlib?

As a side note: Compression is a double-edged sword, as you can go also have performance issue going from larger file sizes spread among multiple nodes to the smaller size & HDFS block size interactions. You can blunt this by using compression strategy.

Comment
Kit Menke

People who voted for this

1 Show 0
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

4 Replies

· Add your reply
  • Sort: 
  • Votes
  • Created
  • Oldest
avatar image
Best Answer

Answer by gopal · Nov 18, 2015 at 06:00 AM

David's post is from 2014. Since then we switched away from standard Zlib in ORC.

See the slides from ORC 2015: Faster, Better, Smaller

Each column type (like string, int etc) get different Zlib compatible algorithms for compression (i.e different trade-offs of RLE/Huffman/LZ77).

ORC+Zlib after the columnar improvements no longer has the historic weaknesses of Zlib, so it is faster than SNAPPY to read, smaller than SNAPPY on disk and only ~10% slower than SNAPPY to write it out.

Comment
Neeraj Sabharwal
bpreachuk
Ancil McBarnett
Tom Benton
Timothy Spann

People who voted for this

5 Show 2 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image Jonas Straub ♦♦ · Nov 18, 2015 at 06:03 AM 0
Share

Thanks @gopal. In this case we should definitely use ORC+(new)Zlib. I'll edit my answer :)

avatar image Tom Benton · Nov 24, 2015 at 04:42 PM 1
Share

@gopal just to confirm, these improvements would require HDP 2.3.x and later correct?

avatar image

Answer by Jonas Straub · Nov 16, 2015 at 09:15 PM

ORC+ZLib seems to have the better performance. ZLib is also the default compression option, however there are definitely valid cases for Snappy.

I like the comment from David (2014, before ZLib Update) "SNAPPY for time based performance, ZLIB for resource performance (Drive Space)." Make sure you checkout David's post: https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance

As @gopal pointed out in the comment, we have switched to a new ZLib algorithm, hence the combination ORC + (new) ZLib is the way to go. The performance difference of ZLib and Snappy regarding disk writes is rather small.

Btw. ZLib is not always the better option, when it comes to HBase, Snappy is usually better :)

Comment
Ancil McBarnett
Guilherme Braccialli
bpreachuk
Smart Solutions
Timothy Spann

People who voted for this

5 Show 0 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image

Answer by Neeraj Sabharwal · Nov 16, 2015 at 08:47 PM

@Ancil McBarnett Performance! Performance! and performance! :)

ORC + Zlib is the way go.

Here are the details based on a test done in my env.

run 1 vs. run 2


screen-shot-2015-11-16-at-34624-pm.png (102.3 kB)
Comment
Jonas Straub
bpreachuk
Timothy Spann

People who voted for this

3 Show 2 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image Jonas Straub ♦♦ · Nov 16, 2015 at 09:18 PM 0
Share

Thanks for sharing! How many datasets were in the Links table? Is the dataset in Links a subset from the ABC dataset?

avatar image Neeraj Sabharwal Jonas Straub ♦♦ · Nov 16, 2015 at 11:26 PM 0
Share

ABC and Links were separate tables. @Jonas Straub

avatar image

Answer by Timothy Spann · Jun 04, 2016 at 05:07 AM

Any updates for 2016

Comment
Kaliyug Antagonist

People who voted for this

1 Show 1 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image gopal · Jun 04, 2016 at 05:34 AM -1
Share

ORC is considering adding a faster decompression in 2016 - zstd (ZStandard). The enum values for that has already been reserved, but until we work through the trade-offs involved in ZStd - more on that sometime later this year.

https://issues.apache.org/jira/browse/ORC-46

But bigger wins are in motion for ORC with LLAP, the in-memory format for LLAP isn't compressed at all - so it performs like ORC without compression overheads, while letting the cold data on disk sit around in Zlib.

Your answer

Hint: You can notify a user about this post by typing @username

Up to 5 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

30
Followers
follow question

Answers Answer & comments

This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.

HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service

© 2011-2019 Hortonworks Inc. All Rights Reserved.

Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Login
  • Create
  • Ask a question
  • Add Repo
  • Create SupportKB
  • Create Article
  • Post Idea
  • Tracks
  • Community Help
  • Cloud & Operations
  • CyberSecurity
  • Data Ingestion & Streaming
  • Data Processing
  • Data Science & Advanced Analytics
  • Design & Architecture
  • Governance & Lifecycle
  • Hadoop Core
  • Sandbox & Learning
  • Security
  • Solutions
  • Explore
  • All Tags
  • All Questions
  • All Repos
  • All SKB
  • All Articles
  • All Ideas
  • All Users
  • Leaderboard
  • All Badges