How to create a hive table (text/sequencefile) with compression enabled?
While creating orc table I gave orc.compress = SNAPPY, but is there any similar option for sequence file/text file? or do I need to just enable compression as below before the insert statement?
I tried to set the above properties and created a table stored as sequencefile and inserted data into that table. But when I checked the compression using "describe formatted tablename"
I still see that compression under storage format as "No". Why is this? Is this the right way to check the table compression? And is it possible to store the textfile too in compressed format?
And even though I enable compression for ORC table as SNAPPY compression, I am able to insert data into SNAPPY compressed orc table irrespective of whether I set the above three SET parameters or not. Even if I set the compression as false, I am able to insert the data into SNAPPY compressed ORC table. SO I am a little confused on the usage of above three parameters. Can someone please explain the usage of the above three settings?
Answer by Ravi Mutyala · May 18, 2016 at 06:23 PM
1. With above parameters, you will get a compresssed sequencefile. But since this is file level compression, you won't see it on hive metadata (with describe formatted tablename).
2. For getting compression with textfiles, you can directly put compressed files in the external table (like file1.gz and file2.gz) in the folder and hive can use them. However, if these files are big, you will not get any splits which will end up as 1 mapper per file.
3. When you use ORC, you don't need to explicitly use hive.exec.compress.outoput and mapred compression as ORCSerDe takes care of compression based on table properties.
Answer by Benjamin Leonhardi · May 18, 2016 at 06:39 PM
So ORC is completely on its own. They follow their own logic and use compression internally. So the files itself are not compressed. The parameters you used don't matter for them. They are intended for Sequence files/delimited ons.
For Sequence files I think you did everything correctly:
For Sequence files ( and delimited files ) essentially hive just uses Hadoop inputformats to read them.. They natively support compressed files and either have it in the sequence file header or in the filename of the delimited file ( .gz will be unpacked with gzip for example ) So Mapreduce/Tez knows if the data is compressed or not and will just unpack it. I have actually no idea what the Compressed means in the describe. Perhaps someone else has an idea there.
However in general my tip: Use ORC for any final table format ( you do not need to worry about compression just set the orc parameters and please use zip, its slightly faster now than snappy and compresses much better. Hive has been heavily optimized for it )
Delimited Files should mainly be used for import/export so normally you will get the files already compressed or you use a tool like pig/spark who compress output that you then import into hadoop.
Sequence files can be a good way of storing temp tables etc. because they are very fast to read/write. What you did should be correct. I would just check the folder size with hadoop fs -du /apps/hive/warehouse/mydb/mytable to see if it gets smaller when you enable compression. Not sure what the describe formatted refers to, I don't think Hive should actually know if the data is compressed or not. ( without checking the data, but might be worthwhile to check the code