hive的mr和map-reduce基本设计模式

Posted 2020-10-04

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hive的mr和map-reduce基本设计模式相关的知识，希望对你有一定的参考价值。

（原创文章，谢绝转载~）

hive可以使用 explain 或 explain extended (select query) 来看mapreduce执行的简要过程描述。explain出来的结果类似以下：

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:    --map tree
          TableScan
            alias: testtb
            Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: zd1 (type: string), zd2 (type: string), zd3 (type: string)
              outputColumnNames: zd1, zd2, zd3
              Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE
              Group By Operator
                aggregations: sum(zd3)
                keys: zd1 (type: string), zd2 (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
                  Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col2 (type: double)
      Reduce Operator Tree:    --reduce tree
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: double)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

可以通过此分析mapreduce过程。以上为对zd1，zd2 分组，求sum（zd3）的mr过程：

这个直接根据需要group by的字段作为 key，hive 默认在map端先做一次聚合（set hive.map.aggr=true），且mode为 hash；然后再到reduce端聚合，此时reduce端的mode为mergepartial，如果设置不在map端聚合set hive.map.aggr=false，那么reduce端的mode是 complete 。

mapreduce的基本设计模式：（参考资料：MapReduce Design Pattern -by Donald Miner and Adam Shook )

1.分组数值聚合，这个模式下map端直接根据需要分组（group by）的字段作为keys，values包括需要的数据，reduce端， f(values) 得到需要的结果（以keys为组）

2.join，map端关联字段作为keys，每条record作为输出，不同表的数据打上flag，reduce端根据每组keys的数据，每个flag的数据放在这个flag的list下，然后不同的list的数据再join输出即可，若inner join那么限制list都不空，left、right join等则list为空也输出。

（待续....）

以上是关于hive的mr和map-reduce基本设计模式的主要内容，如果未能解决你的问题，请参考以下文章