Hive上的小文件的性能问题

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive上的小文件的性能问题相关的知识,希望对你有一定的参考价值。

我正在阅读一篇关于小文件如何降低hive查询性能的文章。 https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1

我理解有关重载NameNode的第一部分。

然而,他所说的有关mapreduce的内容似乎并未发生。对于mapreduce和Tez。

当MapReduce作业启动时,它会为每个正在处理的数据块计划一个映射任务

我没有看到每个文件创建的mapper任务。可能的原因是,他指的是map-reduce的版本1,并且之后做了很多改变。

Hive版本:Hive 1.2.1000.2.6.4.0-91

我的桌子:

create table temp.emp_orc_small_files (id int, name string, salary int)
stored as orcfile;

数据:下面的代码将创建100个小文件,它只包含几个kb的数据。

 for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done

但是我看到只有一个映射器和一个reducer任务被创建用于后续查询。

[root@sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.

Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 7.36 s
--------------------------------------------------------------------------------
OK
4989
Time taken: 13.643 seconds, Fetched: 1 row(s)

与map-reduce相同的结果。

hive> set hive.execution.engine=mr;
hive> select max(salary) from temp.emp_orc_small_files;
Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job  -kill job_1536258296893_0259
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-09-11 20:05:57,213 Stage-1 map = 0%,  reduce = 0%
2018-09-11 20:06:04,727 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.37 sec
2018-09-11 20:06:12,189 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.36 sec
MapReduce Total cumulative CPU time: 7 seconds 360 msec
Ended Job = job_1536258296893_0259
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.36 sec   HDFS Read: 66478 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 360 msec
OK
4989
答案

这是因为以下配置生效

hive.hadoop.supports.splittable.combineinputformat

来自documentation

是否组合小输入文件以便生成更少的映射器。

因此,Hive基本上可以推断输入是一组小于块大小的小文件,并将它们组合在一起,减少了所需的映射器数量。

以上是关于Hive上的小文件的性能问题的主要内容,如果未能解决你的问题,请参考以下文章

Hive优化之小文件问题及其解决方案

Hive如何处理大量小文件

hive中的小文件问题

Hive性能优化之表设计优化

HIVE:小文件合并

大数据之Hive:hive的小文件如何处理