HIVE:在分区表中映射联接
Posted
技术标签:
【中文标题】HIVE:在分区表中映射联接【英文标题】:HIVE : Map Joins in partitioned tables 【发布时间】:2017-06-12 11:20:03 【问题描述】:考虑到具有事实表和维度表的 Hive 中的典型数据仓库场景,假设事实表被拆分为具有分区的多个数据节点。在将事实表(已分区)与维度(未分区)连接时,使用 Map 连接似乎是合乎逻辑的,因为维度表的大小很小,并且它们将存储在内存中以有效地连接所有节点上的事实数据.
但是,很少有在线资源建议在分区表上执行 Map Joins,两个表上的分区键应该与连接键相同。
所以,这是我正在寻找答案的问题:分区表(事实)可以与非分区表(维度)进行 MAP 连接吗?
【问题讨论】:
【参考方案1】:答案是——是的
映射连接运算符
演示
create table fact (rec_id int,dim_id int) partitioned by (dt date);
create table dim (dim_id int,descr string);
explain
select *
from fact f join dim d
on d.dim_id = f.dim_id
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
d
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
d
TableScan
alias: d
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: dim_id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
HashTable Sink Operator
keys:
0 dim_id (type: int)
1 dim_id (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: f
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: dim_id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 dim_id (type: int)
1 dim_id (type: int)
outputColumnNames: _col0, _col1, _col2, _col6, _col7
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: int), _col2 (type: date), _col6 (type: int), _col7 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
【讨论】:
以上是关于HIVE:在分区表中映射联接的主要内容,如果未能解决你的问题,请参考以下文章
打怪升级之小白的大数据之旅(六十六)<Hive旅程第七站:Hive的分区表与分桶表>