Hortonworks.com
  • Explore
    • All Tags
    • All Questions
    • All Repos
    • All Repos
    • All SKB
    • All SKB
    • All Articles
    • All Ideas
    • All Articles
    • All Ideas
    • All Users
    • All Badges
    • Leaderboard
  • Create
    • Ask a question
    • Add Repo
    • Add Repo
    • Create Article
    • Post Idea
    • Create Article
    • Post Idea
  • Tracks
    • All Tracks
    • Community Help
    • Cloud & Operations
    • CyberSecurity
    • Data Ingestion & Streaming
    • Data Processing
    • Data Science & Advanced Analytics
    • Design & Architecture
    • Governance & Lifecycle
    • Hadoop Core
    • Sandbox & Learning
    • Security
    • Solutions
  • Login
HCC Hortonworks Community Connection
  • Home /
  • Data Science & Advanced Analytics /
avatar image

Hive Multiple Small Files

Question by Nikkie Thomas Jun 09, 2017 at 05:33 AM Hive

Hi ,

I have large csv files which arrives Hadoop on a daily basis.(10GB). 1 file per day. I have a Hive external table and point it to the files (No partitions / No ORC) - Table1. I have another table Table2(external table + ORC-ZLIB) partitioned by date(yyyy-mm-dd) loaded from Table1 using insert into Table2 partition(columnname) select * from Table1 with hive.exec.dynamic.partition = true enabled. The daily files once compressed via ORC comes to <10MB(this was a surprise to me looking at the compression ratio). I have read about the multiple small file problems in Hadoop from the HW community.

Is there any additional settings in Hive / considerations to be in place so that we don't run into performance issues caused by the multiple small files?

Thanks

Nikkie

Comment
Eyad Garelnabi

People who voted for this

1 Show 0
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

5 Replies

· Add your reply
  • Sort: 
  • Votes
  • Created
  • Oldest
avatar image

Answer by Bala Vignesh N V · Jun 09, 2017 at 11:31 AM

Hi Nikkie Thomas

To control the no of files inserted in hive tables we can either change the no of mapper/reducers to 1 depending on the need, so that the final output file will always be one. If not anyone of the below things should be enable to merge a reducer output if the size is less than an block size.

  • hive.merge.mapfiles -- Merge small files at the end of a map-only job.
  • hive.merge.mapredfiles -- Merge small files at the end of a map-reduce job.
  • hive.merge.size.per.task -- Size of merged files at the end of the job.
  • hive.merge.smallfiles.avgsize -- When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.
Comment
Eyad Garelnabi

People who voted for this

1 Show 1 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image Smart Solutions · Jun 22, 2017 at 06:11 AM 0
Share

Hi @Bala Vignesh N V, I have similar issue. have done the above settings, but this does not help. I have posted a question on HCC : https://community.hortonworks.com/questions/109365/controlling-number-of-small-files-while-inserting.html.

avatar image

Answer by Nikkie Thomas · Jun 09, 2017 at 11:52 AM

Thanks for your response.If I partition the data by yyyy-mm-dd field and I receive only one file per day. I assume , I will always have one file per partition irrespective of this setting?

If the above assumption is correct(pls correct if that is wrong), will I end up with select queries which runs slower if I store files for say 6 years?

i.e I will have 6 * 365 files each around say 8-9MB in size (which is smaller than the default chunk size).

I was hoping to consolidate the files on a weekly basis but , I need to have data available to users on a daily basis ..hence I dont think I can do that.

Let me know the suggestions.

Thanks

Nikkie

Comment

People who voted for this

0 Show 0 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image

Answer by Bala Vignesh N V · Jun 09, 2017 at 02:49 PM

Nikkie Thomas

If I partition the data by yyyy-mm-dd field and I receive only one file per day. I assume , I will always have one file per partition irrespective of this setting? --> Its not that simple, because it depends on the size of your input file, block size, size of mapper /reducer an other variables. Considering your input file is less than the block size then it should create only one file.

If you partition the table on a daily basis with less size then in growth of time it will cause performance issues and there is not much to do with partition. What I would say on such condition, is that partition the table on yearly basis with buckets on a frequently used filter column. In your case it can be daily/weekly/yearly basis. But still each file in a bucketed folder will be less if the data size is less.

Comment

People who voted for this

0 Show 0 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image

Answer by Laurent Edel · Jun 13, 2017 at 03:09 PM

@Nikkie Thomas you can specify the number of reducers for a query :

hive> set mapreduce.job.reduces=1;
Comment

People who voted for this

0 Show 0 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image

Answer by Michał Baran · Jan 10 at 11:14 PM

Hi,

I found out that when using Tez (an execution engine on Hive) you should use another parameter to get as an output only 1 file:

SET hive.merge.tezfiles=true;

Tez in many cases is faster than MR2 engine, to verify what execution engine you use run on Hive:

SET hive.execution.engine; 

if you want to switch to Tez, just set it etiher in hive-site,xml or for each Hive session:

SET  hive.execution.engine=tez;Best regards,

Michał

Comment

People who voted for this

0 Show 0 · Share
10 |6000 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Your answer

Hint: You can notify a user about this post by typing @username

Up to 5 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

68
Followers
follow question

Answers Answer & comments

HCC Guidelines | HCC FAQs | HCC Privacy Policy

Hortonworks - Develops, Distributes and Supports Open Enterprise Hadoop.

© 2011-2017 Hortonworks Inc. All Rights Reserved.
Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.
Privacy Policy | Terms of Service

HCC Guidelines | HCC FAQs | HCC Privacy Policy | Privacy Policy | Terms of Service

© 2011-2018 Hortonworks Inc. All Rights Reserved.

Hadoop, Falcon, Atlas, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Login
  • Create
  • Ask a question
  • Add Repo
  • Add Repo
  • Create SupportKB
  • Create SupportKB
  • Create Article
  • Post Idea
  • Create Article
  • Post Idea
  • Tracks
  • Community Help
  • Cloud & Operations
  • CyberSecurity
  • Data Ingestion & Streaming
  • Data Processing
  • Data Science & Advanced Analytics
  • Design & Architecture
  • Governance & Lifecycle
  • Hadoop Core
  • Sandbox & Learning
  • Security
  • Solutions
  • Explore
  • All Tags
  • All Questions
  • All Repos
  • All Repos
  • All SKB
  • All SKB
  • All Articles
  • All Ideas
  • All Articles
  • All Ideas
  • All Users
  • Leaderboard
  • All Badges