大查询:加入第二个表中的单个最新行

Posted

技术标签:

【中文标题】大查询:加入第二个表中的单个最新行【英文标题】:Big Query: Join single latest row from second table 【发布时间】:2020-08-10 20:51:53 【问题描述】:

我有两张桌子。一个是Orders 的列表,一个是Events 的列表。

对于每个Order,我想加入在created_atcreated_at 之前发生的最后一个Event(使用clicked_at)。

我已经尝试了多种方法来使其正常工作,并在 Stack Overflow 上尝试了其他几个答案,但我正在努力返回正确的数据。

我心目中子查询的 sudo 逻辑是这样的:

SELECT campaign, user_id, created_at 
FROM `Events`
WHERE order.user_id = user_id AND clicked_at < order.created_at
ORDER created_at DESC
LIMIT 1

请看下面的示例数据:

# Orders

| order_id | user_id | created_at |
-----------------------------------
| 123      | abc     | 2020-07-04 |
| 456      | abc     | 2020-05-01 |


# Events

| campaign | keyword  | user_id | clicked_at |
----------------------------------------------
| facebook | shoes    | abc     | 2020-07-03 |
| google   | hair     | abc     | 2020-07-01 |

我想要的结果

# Orders with campaign attribution

| order_id | user_id | created_at | campaign | keyword  |
---------------------------------------------------------
| 123      | abc     | 2020-07-04 | facebook | shoes    |
| 456      | abc     | 2020-06-04 | null     | null     | 

谢谢! 亚历克斯

【问题讨论】:

【参考方案1】:
with orders as (
  select 123 as order_id, 'abc' as user_id, cast('2020-07-04' as date) as created_at union all
  select 456, 'abc', '2020-05-01'
),
events as (
  select 'facebook' as campaign, 'shoes' as keyword, 'abc' as user_id, cast('2020-07-03' as date) as clicked_at union all
  select 'google', 'hair', 'abc', '2020-07-01'
),
logic as (
  select
    orders.order_id, 
    orders.user_id, 
    orders.created_at, 
    events.clicked_at,
    events.campaign, 
    events.keyword, 
    row_number() over (partition by orders.order_id order by events.clicked_at desc) as rn
  from orders
  left join events 
  on orders.user_id = events.user_id and events.clicked_at < orders.created_at
)
select * except(rn)
from logic 
where rn = 1

【讨论】:

【参考方案2】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT a.*, campaign, keyword
FROM  `project.dataset.orders` a
LEFT JOIN (
  SELECT  
    ANY_VALUE(o).*, 
    ARRAY_AGG(STRUCT(campaign, keyword) ORDER BY clicked_at DESC LIMIT 1)[OFFSET(0)].*
  FROM `project.dataset.orders` o
  JOIN `project.dataset.events` e
  ON o.user_id = e.user_id
  AND clicked_at < created_at
  GROUP BY FORMAT('%t', o)
)
USING(order_id)   

如果应用于我们问题的样本数据 - 结果是

Row order_id    user_id created_at  campaign    keyword  
1   123         abc     2020-07-04  facebook    shoes    
2   456         abc     2020-05-01  null        null     

【讨论】:

以上是关于大查询:加入第二个表中的单个最新行的主要内容,如果未能解决你的问题,请参考以下文章