Spark SQL join 条件写法遇见的坑 left join 最终被当做inner join执行
Posted javartisan
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark SQL join 条件写法遇见的坑 left join 最终被当做inner join执行相关的知识,希望对你有一定的参考价值。
SQL1:
SELECT lower(trim(user_log_acct)) as user_log_acct,
after_prefr_amount_1,
sale_qtty,
free_goods_sale_qtty,
parent_sale_ord_id,
ord_flag,
shop_id,
main_sku_id,
item_id,
if(b.brand_cd is null, o.brand_cd, b.main_brand_cd) as main_brand_cd,
item_third_cate_cd
FROM adm.adm_d04_trade_ord_det_sku_snapshot o
LEFT JOIN app.app_dm_dim_4a_main_brand b on o.brand_cd = b.brand_cd
WHERE o.dt >= '2019-11-25'
and o.dt <= '2020-11-23'
and b.dt = '2020-11-23'
and intraday_ord_deal_flag = 1
and item_second_cate_cd in ('6145', '6160', '12040', '16753', '16751', '1387', '5026', '2615', '1381')
执行计划1:
== Physical Plan ==
*(2) Project [lower(trim(user_log_acct#603, None)) AS user_log_acct#395, after_prefr_amount_1#733, sale_qtty#730L, free_goods_sale_qtty#732L, parent_sale_ord_id#602, ord_flag#647, shop_id#720, main_sku_id#685, item_id#687, if (isnull(brand_cd#786)) brand_cd#690 else main_brand_cd#788 AS main_brand_cd#396, item_third_cate_cd#696]
+- *(2) BroadcastHashJoin [cast(brand_cd#690 as double)], [cast(brand_cd#786 as double)], Inner, BuildRight
:- *(2) Project [parent_sale_ord_id#602, user_log_acct#603, ord_flag#647, main_sku_id#685, item_id#687, brand_cd#690, item_third_cate_cd#696, shop_id#720, sale_qtty#730L, free_goods_sale_qtty#732L, after_prefr_amount_1#733]
: +- *(2) Filter (((isnotnull(intraday_ord_deal_flag#659) && (cast(intraday_ord_deal_flag#659 as double) = 1.0)) && item_second_cate_cd#694 IN (6145,6160,12040,16753,16751,1387,5026,2615,1381)) && isnotnull(brand_cd#690))
: +- *(2) FileScan orc adm.adm_d04_trade_ord_det_sku_snapshot[parent_sale_ord_id#602,user_log_acct#603,ord_flag#647,intraday_ord_deal_flag#659,main_sku_id#685,item_id#687,brand_cd#690,item_second_cate_cd#694,item_third_cate_cd#696,shop_id#720,sale_qtty#730L,free_goods_sale_qtty#732L,after_prefr_amount_1#733,dt#784,tp#785] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://ns1/user/dd_edw/adm.db/adm_d04_trade_ord_det_sku_snapshot/dt=2019-..., PartitionFilters: [isnotnull(dt#784), (dt#784 >= 2019-11-25), (dt#784 <= 2020-11-23)], PushedFilters: [IsNotNull(intraday_ord_deal_flag), In(item_second_cate_cd, [6145,6160,12040,16753,16751,1387,502..., ReadSchema: struct<parent_sale_ord_id:string,user_log_acct:string,ord_flag:string,intraday_ord_deal_flag:stri...
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as double)))
+- *(1) Project [brand_cd#786, main_brand_cd#788]
+- *(1) Filter isnotnull(brand_cd#786)
+- *(1) FileScan orc app.app_dm_dim_4a_main_brand[brand_cd#786,main_brand_cd#788,dt#790] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://ns1016/user/xx_ad/ads_dm/app.db/app_dm_dim_4a_main_brand/dt=2020-1..., PartitionFilters: [isnotnull(dt#790), (dt#790 = 2020-11-23)], PushedFilters: [IsNotNull(brand_cd)], ReadSchema: struct<brand_cd:int,main_brand_cd:string>
执行计划显示:将left join 修改为了inner join
--------
SQL2:
SELECT lower(trim(user_log_acct)) as user_log_acct,
after_prefr_amount_1,
sale_qtty,
free_goods_sale_qtty,
parent_sale_ord_id,
ord_flag,
shop_id,
main_sku_id,
item_id,
if(b.brand_cd is null, o.brand_cd, b.main_brand_cd) as main_brand_cd,
item_third_cate_cd
FROM adm.adm_d04_trade_ord_det_sku_snapshot o
LEFT JOIN(select brand_cd, main_brand_cd from app.app_dm_dim_4a_main_brand where dt = '2020-11-23') b
ON o.brand_cd = b.brand_cd
WHERE o.dt >= '2019-11-25'
AND o.dt <= '2020-11-23'
AND intraday_ord_deal_flag = 1
AND item_second_cate_cd IN (6145, 6160, 12040, 16753, 16751, 1387, 5026, 2615, 1381)
执行计划2:
== Physical Plan ==
*(2) Project [lower(trim(user_log_acct#1203, None)) AS user_log_acct#995, after_prefr_amount_1#1333, sale_qtty#1330L, free_goods_sale_qtty#1332L, parent_sale_ord_id#1202, ord_flag#1247, shop_id#1320, main_sku_id#1285, item_id#1287, if (isnull(brand_cd#1386)) brand_cd#1290 else main_brand_cd#1388 AS main_brand_cd#996, item_third_cate_cd#1296]
+- *(2) BroadcastHashJoin [cast(brand_cd#1290 as double)], [cast(brand_cd#1386 as double)], LeftOuter, BuildRight
:- *(2) Project [parent_sale_ord_id#1202, user_log_acct#1203, ord_flag#1247, main_sku_id#1285, item_id#1287, brand_cd#1290, item_third_cate_cd#1296, shop_id#1320, sale_qtty#1330L, free_goods_sale_qtty#1332L, after_prefr_amount_1#1333]
: +- *(2) Filter ((isnotnull(intraday_ord_deal_flag#1259) && (cast(intraday_ord_deal_flag#1259 as double) = 1.0)) && item_second_cate_cd#1294 IN (6145,6160,12040,16753,16751,1387,5026,2615,1381))
: +- *(2) FileScan orc adm.adm_d04_trade_ord_det_sku_snapshot[parent_sale_ord_id#1202,user_log_acct#1203,ord_flag#1247,intraday_ord_deal_flag#1259,main_sku_id#1285,item_id#1287,brand_cd#1290,item_second_cate_cd#1294,item_third_cate_cd#1296,shop_id#1320,sale_qtty#1330L,free_goods_sale_qtty#1332L,after_prefr_amount_1#1333,dt#1384,tp#1385] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://ns1/user/dd_edw/adm.db/adm_d04_trade_ord_det_sku_snapshot/dt=2019-..., PartitionFilters: [isnotnull(dt#1384), (dt#1384 >= 2019-11-25), (dt#1384 <= 2020-11-23)], PushedFilters: [IsNotNull(intraday_ord_deal_flag), In(item_second_cate_cd, [6145,6160,12040,16753,16751,1387,502..., ReadSchema: struct<parent_sale_ord_id:string,user_log_acct:string,ord_flag:string,intraday_ord_deal_flag:stri...
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as double)))
+- *(1) Project [brand_cd#1386, main_brand_cd#1388]
+- *(1) Filter isnotnull(brand_cd#1386)
+- *(1) FileScan orc app.app_dm_dim_4a_main_brand[brand_cd#1386,main_brand_cd#1388,dt#1390] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://ns1016/user/xx_ad/ads_dm/app.db/app_dm_dim_4a_main_brand/dt=2020-1..., PartitionFilters: [isnotnull(dt#1390), (dt#1390 = 2020-11-23)], PushedFilters: [IsNotNull(brand_cd)], ReadSchema: struct<brand_cd:int,main_brand_cd:string>
以上是关于Spark SQL join 条件写法遇见的坑 left join 最终被当做inner join执行的主要内容,如果未能解决你的问题,请参考以下文章