在 hive 表中生成前导值

Posted

技术标签:

【中文标题】在 hive 表中生成前导值【英文标题】:Generate leading values in hive table 【发布时间】:2021-03-25 00:41:29 【问题描述】:

我有两个表,如下 cust_msg 和 cust_audit。

cust_msg
+-----------+----------+------------+----------+                                        
|cust_id    |first_name|progress    |process_dt|
+-----------+----------+------------+----------+
|106674     |Charley   |Initiate    |20210202  |
|106674     |Charley   |Review      |20210203  |
|106674     |Charley   |Realign     |20210204  |
|106674     |Charley   |Approved    |20210211  |
|106674     |Charley   |Installation|20210216  |
|106674     |Charley   |Survey      |20210323  |
+-----------+----------+------------+----------+


cust_audit
+-----------+----------+----------------------------+----------+                                        
|cust_id    |agent_id  |adt_log                     |process_dt|
+-----------+----------+----------------------------+----------+
|106674     |602       |Promotional Offer sent      |20210112  |
|106674     |602       |Click Open Promo            |20210113  |
|106674     |100       |Promo Inquiry               |20210114  |
|106674     |100       |Cust Waiting                |20210118  |
|106674     |100       |Customer Application        |20210119  |
|106674     |602       |Appl Approved               |20210122  |
|106674     |602       |Initiate Appl               |20210201  |
|106674     |602       |Sale Initiated              |20210202  |
|106674     |602       |sale Rv Pending             |20210203  |
|106674     |602       |sales-cust Realign          |20210203  |
|106674     |602       |cust in aggrement           |20210204  |
|106674     |602       |Sales Dep Approve           |20210208  |
|106674     |602       |mgt Approved                |20210211  |
|106674     |602       |Installation pending        |20210216  |
|106674     |602       |Cust Survey                 |20210323  |
+-----------+----------+----------------------------+----------+

我需要构建另一个表cust_del_detail,如下所示 progress 应该是 Initiate 直到 process_dt = 20210202 因为表中的 progress cust_msgInitiate 在 20210202 上, progress 应该是 Review on 20210203 & progress 应该是 Realign2021020420210208 & 同样Approved on 20210211 & Installation 20210216 &Survey 20210323

cust_del_detail
+-----------+--------------+----------+---------------------------+---------------+----------+                                        
|cust_id    |first_name    |agent_id  |adt_log                    |progress       |process_dt|
+-----------+--------------+----------+---------------------------+---------------+----------+
|106674     |Charley       |602       |Promotional Offer sent     |Initiate       |20210112  |
|106674     |Charley       |602       |Click Open Promo           |Initiate       |20210113  |
|106674     |Charley       |100       |Promo Inquiry              |Initiate       |20210114  |
|106674     |Charley       |100       |Cust Waiting               |Initiate       |20210118  |
|106674     |Charley       |100       |Customer Application       |Initiate       |20210119  |
|106674     |Charley       |602       |Appl Approved              |Initiate       |20210122  |
|106674     |Charley       |602       |Initiate Appl              |Initiate       |20210201  |
|106674     |Charley       |602       |Sale Initiated             |Initiate       |20210202  |
|106674     |Charley       |602       |sale Rv Pending            |Review         |20210203  |
|106674     |Charley       |602       |sales-cust Realign         |Review         |20210203  |
|106674     |Charley       |602       |cust in aggrement          |Realign        |20210204  |
|106674     |Charley       |602       |Sales Dep Approve          |Realign        |20210208  |
|106674     |Charley       |602       |mgt Approved               |Approved       |20210211  |
|106674     |Charley       |602       |Installation pending       |Installation   |20210216  |
|106674     |Charley       |602       |Cust Survey                |Survey         |20210323  |
+-----------+--------------+----------+---------------------------+---------------+----------+

我尝试在 hive 中使用前导窗口功能,但无法实现。 在 Hive 或 Pyspark 中实现这一目标的最佳方法是什么?

【问题讨论】:

逻辑有点混乱。为什么Realign 将从 20210204 变为 20210208?我试图理解为什么某些progress 应该与adt_log 绑定。您可以根据日期说,但如何?你能提供更多细节吗?如果有帮助,也许您可​​以同时加入两者并在 audit_log.process_dt 之间添加像 process_dt 这样的连接条件 Realign 加入了 process_dt 20210204 到 20210208 因为,在表中 cust_msg Realign 进入 20210204 直到 20210211 是的,我需要加入表 cust_auditcust_msgcust_audit 数据中有first_nameprogress 字段。结果表中的progress 应基于process_dtcust_auditcust_msg 在表cust_msgprogress Initiate 出现在20210202 上。所以直到cust_audit中的process_dt 20210202我想显示Initiate=进度 【参考方案1】:

您可以使用lead 并加入:

from pyspark.sql import functions as F, Window

cust_del_detail = cust_msg.withColumn(
    'lead_dt', 
    F.lead('process_dt').over(Window.partitionBy('cust_id').orderBy('process_dt'))
).alias('cust_msg').join(
    cust_audit.alias('cust_audit'), 
    F.expr('''
        (progress != "Initiate" 
         and (cust_audit.process_dt < cust_msg.lead_dt or cust_msg.lead_dt is null) 
         and (cust_audit.process_dt >= cust_msg.process_dt)
        ) 
        or 
        (progress = "Initiate" 
         and (cust_audit.process_dt <= cust_msg.process_dt)
        )
    '''),
    'right'
).selectExpr(
    'cust_msg.cust_id', 
    'cust_msg.first_name', 
    'cust_audit.agent_id', 
    'cust_audit.adt_log', 
    'cust_msg.progress', 
    'cust_audit.process_dt'
)

结果:

cust_del_detail.show(truncate=False)
+-------+----------+--------+----------------------+------------+----------+
|cust_id|first_name|agent_id|adt_log               |progress    |process_dt|
+-------+----------+--------+----------------------+------------+----------+
|106674 |Charley   |602     |Promotional Offer sent|Initiate    |20210112  |
|106674 |Charley   |602     |Click Open Promo      |Initiate    |20210113  |
|106674 |Charley   |100     |Promo Inquiry         |Initiate    |20210114  |
|106674 |Charley   |100     |Cust Waiting          |Initiate    |20210118  |
|106674 |Charley   |100     |Customer Application  |Initiate    |20210119  |
|106674 |Charley   |602     |Appl Approved         |Initiate    |20210122  |
|106674 |Charley   |602     |Initiate Appl         |Initiate    |20210201  |
|106674 |Charley   |602     |Sale Initiated        |Initiate    |20210202  |
|106674 |Charley   |602     |sale Rv Pending       |Review      |20210203  |
|106674 |Charley   |602     |sales-cust Realign    |Review      |20210203  |
|106674 |Charley   |602     |cust in aggrement     |Realign     |20210204  |
|106674 |Charley   |602     |Sales Dep Approve     |Realign     |20210208  |
|106674 |Charley   |602     |mgt Approved          |Approved    |20210211  |
|106674 |Charley   |602     |Installation pending  |Installation|20210216  |
|106674 |Charley   |602     |Cust Survey           |Survey      |20210323  |
+-------+----------+--------+----------------------+------------+----------+

【讨论】:

以上是关于在 hive 表中生成前导值的主要内容,如果未能解决你的问题,请参考以下文章

在维度表中生成默认值

在PostgreSQL 和 Hive中生成日期序列

HiveServer2 在 hdfs /tmp/hive/hive 中生成很多目录

从列值生成 Hive 行

Hive UDF 从列表中生成所有可能的有序组合

在 SQL (Big Query) 中生成序列/范围/数组,其中最小值和最大值取自另一个表