在 hive 表中生成前导值
Posted
技术标签:
【中文标题】在 hive 表中生成前导值【英文标题】:Generate leading values in hive table 【发布时间】:2021-03-25 00:41:29 【问题描述】:我有两个表,如下 cust_msg 和 cust_audit。
cust_msg
+-----------+----------+------------+----------+
|cust_id |first_name|progress |process_dt|
+-----------+----------+------------+----------+
|106674 |Charley |Initiate |20210202 |
|106674 |Charley |Review |20210203 |
|106674 |Charley |Realign |20210204 |
|106674 |Charley |Approved |20210211 |
|106674 |Charley |Installation|20210216 |
|106674 |Charley |Survey |20210323 |
+-----------+----------+------------+----------+
cust_audit
+-----------+----------+----------------------------+----------+
|cust_id |agent_id |adt_log |process_dt|
+-----------+----------+----------------------------+----------+
|106674 |602 |Promotional Offer sent |20210112 |
|106674 |602 |Click Open Promo |20210113 |
|106674 |100 |Promo Inquiry |20210114 |
|106674 |100 |Cust Waiting |20210118 |
|106674 |100 |Customer Application |20210119 |
|106674 |602 |Appl Approved |20210122 |
|106674 |602 |Initiate Appl |20210201 |
|106674 |602 |Sale Initiated |20210202 |
|106674 |602 |sale Rv Pending |20210203 |
|106674 |602 |sales-cust Realign |20210203 |
|106674 |602 |cust in aggrement |20210204 |
|106674 |602 |Sales Dep Approve |20210208 |
|106674 |602 |mgt Approved |20210211 |
|106674 |602 |Installation pending |20210216 |
|106674 |602 |Cust Survey |20210323 |
+-----------+----------+----------------------------+----------+
我需要构建另一个表cust_del_detail
,如下所示
progress
应该是 Initiate
直到 process_dt = 20210202
因为表中的 progress
cust_msg
是 Initiate
在 20210202 上,
progress
应该是 Review
on 20210203
&
progress
应该是 Realign
从 20210204
到 20210208
&
同样Approved
on 20210211
&
Installation
20210216
&Survey
20210323
cust_del_detail
+-----------+--------------+----------+---------------------------+---------------+----------+
|cust_id |first_name |agent_id |adt_log |progress |process_dt|
+-----------+--------------+----------+---------------------------+---------------+----------+
|106674 |Charley |602 |Promotional Offer sent |Initiate |20210112 |
|106674 |Charley |602 |Click Open Promo |Initiate |20210113 |
|106674 |Charley |100 |Promo Inquiry |Initiate |20210114 |
|106674 |Charley |100 |Cust Waiting |Initiate |20210118 |
|106674 |Charley |100 |Customer Application |Initiate |20210119 |
|106674 |Charley |602 |Appl Approved |Initiate |20210122 |
|106674 |Charley |602 |Initiate Appl |Initiate |20210201 |
|106674 |Charley |602 |Sale Initiated |Initiate |20210202 |
|106674 |Charley |602 |sale Rv Pending |Review |20210203 |
|106674 |Charley |602 |sales-cust Realign |Review |20210203 |
|106674 |Charley |602 |cust in aggrement |Realign |20210204 |
|106674 |Charley |602 |Sales Dep Approve |Realign |20210208 |
|106674 |Charley |602 |mgt Approved |Approved |20210211 |
|106674 |Charley |602 |Installation pending |Installation |20210216 |
|106674 |Charley |602 |Cust Survey |Survey |20210323 |
+-----------+--------------+----------+---------------------------+---------------+----------+
我尝试在 hive 中使用前导窗口功能,但无法实现。 在 Hive 或 Pyspark 中实现这一目标的最佳方法是什么?
【问题讨论】:
逻辑有点混乱。为什么Realign
将从 20210204 变为 20210208?我试图理解为什么某些progress
应该与adt_log
绑定。您可以根据日期说,但如何?你能提供更多细节吗?如果有帮助,也许您可以同时加入两者并在 audit_log.process_dt
之间添加像 process_dt
这样的连接条件
Realign
加入了 process_dt 20210204 到 20210208 因为,在表中 cust_msg
Realign
进入 20210204
直到 20210211
是的,我需要加入表 cust_audit
和cust_msg
在cust_audit
数据中有first_name
和progress
字段。结果表中的progress
应基于process_dt
表cust_audit
和cust_msg
在表cust_msg
中progress
Initiate
出现在20210202
上。所以直到cust_audit
中的process_dt 20210202
我想显示Initiate
=进度
【参考方案1】:
您可以使用lead
并加入:
from pyspark.sql import functions as F, Window
cust_del_detail = cust_msg.withColumn(
'lead_dt',
F.lead('process_dt').over(Window.partitionBy('cust_id').orderBy('process_dt'))
).alias('cust_msg').join(
cust_audit.alias('cust_audit'),
F.expr('''
(progress != "Initiate"
and (cust_audit.process_dt < cust_msg.lead_dt or cust_msg.lead_dt is null)
and (cust_audit.process_dt >= cust_msg.process_dt)
)
or
(progress = "Initiate"
and (cust_audit.process_dt <= cust_msg.process_dt)
)
'''),
'right'
).selectExpr(
'cust_msg.cust_id',
'cust_msg.first_name',
'cust_audit.agent_id',
'cust_audit.adt_log',
'cust_msg.progress',
'cust_audit.process_dt'
)
结果:
cust_del_detail.show(truncate=False)
+-------+----------+--------+----------------------+------------+----------+
|cust_id|first_name|agent_id|adt_log |progress |process_dt|
+-------+----------+--------+----------------------+------------+----------+
|106674 |Charley |602 |Promotional Offer sent|Initiate |20210112 |
|106674 |Charley |602 |Click Open Promo |Initiate |20210113 |
|106674 |Charley |100 |Promo Inquiry |Initiate |20210114 |
|106674 |Charley |100 |Cust Waiting |Initiate |20210118 |
|106674 |Charley |100 |Customer Application |Initiate |20210119 |
|106674 |Charley |602 |Appl Approved |Initiate |20210122 |
|106674 |Charley |602 |Initiate Appl |Initiate |20210201 |
|106674 |Charley |602 |Sale Initiated |Initiate |20210202 |
|106674 |Charley |602 |sale Rv Pending |Review |20210203 |
|106674 |Charley |602 |sales-cust Realign |Review |20210203 |
|106674 |Charley |602 |cust in aggrement |Realign |20210204 |
|106674 |Charley |602 |Sales Dep Approve |Realign |20210208 |
|106674 |Charley |602 |mgt Approved |Approved |20210211 |
|106674 |Charley |602 |Installation pending |Installation|20210216 |
|106674 |Charley |602 |Cust Survey |Survey |20210323 |
+-------+----------+--------+----------------------+------------+----------+
【讨论】:
以上是关于在 hive 表中生成前导值的主要内容,如果未能解决你的问题,请参考以下文章