大查询:加入第二个表中的单个最新行
Posted
技术标签:
【中文标题】大查询:加入第二个表中的单个最新行【英文标题】:Big Query: Join single latest row from second table 【发布时间】:2020-08-10 20:51:53 【问题描述】:我有两张桌子。一个是Orders
的列表,一个是Events
的列表。
对于每个Order
,我想加入在created_at
的created_at
之前发生的最后一个Event
(使用clicked_at
)。
我已经尝试了多种方法来使其正常工作,并在 Stack Overflow 上尝试了其他几个答案,但我正在努力返回正确的数据。
我心目中子查询的 sudo 逻辑是这样的:
SELECT campaign, user_id, created_at
FROM `Events`
WHERE order.user_id = user_id AND clicked_at < order.created_at
ORDER created_at DESC
LIMIT 1
请看下面的示例数据:
# Orders
| order_id | user_id | created_at |
-----------------------------------
| 123 | abc | 2020-07-04 |
| 456 | abc | 2020-05-01 |
# Events
| campaign | keyword | user_id | clicked_at |
----------------------------------------------
| facebook | shoes | abc | 2020-07-03 |
| google | hair | abc | 2020-07-01 |
我想要的结果
# Orders with campaign attribution
| order_id | user_id | created_at | campaign | keyword |
---------------------------------------------------------
| 123 | abc | 2020-07-04 | facebook | shoes |
| 456 | abc | 2020-06-04 | null | null |
谢谢! 亚历克斯
【问题讨论】:
【参考方案1】:with orders as (
select 123 as order_id, 'abc' as user_id, cast('2020-07-04' as date) as created_at union all
select 456, 'abc', '2020-05-01'
),
events as (
select 'facebook' as campaign, 'shoes' as keyword, 'abc' as user_id, cast('2020-07-03' as date) as clicked_at union all
select 'google', 'hair', 'abc', '2020-07-01'
),
logic as (
select
orders.order_id,
orders.user_id,
orders.created_at,
events.clicked_at,
events.campaign,
events.keyword,
row_number() over (partition by orders.order_id order by events.clicked_at desc) as rn
from orders
left join events
on orders.user_id = events.user_id and events.clicked_at < orders.created_at
)
select * except(rn)
from logic
where rn = 1
【讨论】:
【参考方案2】:以下是 BigQuery 标准 SQL
#standardSQL
SELECT a.*, campaign, keyword
FROM `project.dataset.orders` a
LEFT JOIN (
SELECT
ANY_VALUE(o).*,
ARRAY_AGG(STRUCT(campaign, keyword) ORDER BY clicked_at DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.orders` o
JOIN `project.dataset.events` e
ON o.user_id = e.user_id
AND clicked_at < created_at
GROUP BY FORMAT('%t', o)
)
USING(order_id)
如果应用于我们问题的样本数据 - 结果是
Row order_id user_id created_at campaign keyword
1 123 abc 2020-07-04 facebook shoes
2 456 abc 2020-05-01 null null
【讨论】:
以上是关于大查询:加入第二个表中的单个最新行的主要内容,如果未能解决你的问题,请参考以下文章
Oracle:将两个表与一个公共列加上第二个表中的一个附加列(最新生效日期)连接以选择其他列
通过 linq 对实体查询进行分组,以通过加入表来获取具有最新时间戳的一条记录