操作超时 - BigQuery 优化窗口功能
Posted
技术标签:
【中文标题】操作超时 - BigQuery 优化窗口功能【英文标题】:Operation timed out - BigQuery optimizing window function 【发布时间】:2021-12-18 16:57:48 【问题描述】:我非常感谢任何优化以下查询的建议
由于此查询在当前状态下超时。由于BikeLogsTable
包含大约 10000 - 1 百万行,可以在每一行上加入,因此,此查询在 6 小时后超时。
请注意,这里的核心目标是将 BikesTable 的每一行与州中的相关行相关联。在这一步中,我试图确定 BikeLogTable 的关联行并检索行号以确定最后一个关联条目。
Select Bikes.vehicleId,Bikes.timestamp_field_24,States.to,States.vehiclestatechangeid
ROW_NUMBER() OVER (
PARTITION BY Bikes.timestamp_field_24,Bikes.vehicleId
ORDER BY States.timestamp
)
from `BikesTable` as Bikes
Right Join `BikeLogsTable` as States
on Bikes.vehicleId = CAST(States.vehicleid as String)
WHERE
DATE_SUB( PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) , INTERVAL 1 DAY) < States.timestamp AND
PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) > States.timestamp AND
States.to is not null and
PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) > "2021-11-22" and
PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) < "2021-11-24"
执行细节:
由于这是一项一次性任务,我很高兴收到有关古怪解决方法的建议。
【问题讨论】:
一方面,你有很多嵌套的正则表达式。似乎您可以使用更好的正则表达式逻辑来加快这一速度,它可以一次匹配多个模式,也可以在临时表中执行一次,而不是在选择中多次计算同一列。 不幸的是,这不是所需的减少,操作仍然超时。但是,我正在尝试通过聚集表的方法并通过分区 python 函数执行连接。 【参考方案1】:2 条建议
重写您的查询,使最大的表位于左侧,在本例中为 BikeLogsTable
。
尽早进行尽可能多的转换和过滤,包括在加入之前。
稍微重写你的查询...
with
Bikes as (
select
*,
-- handle your casting/parsing here instead of multiple times in your where clause
DATE_SUB( PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) , INTERVAL 1 DAY) as ts1,
PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) as ts2
from `BikesTable`
),
States as (
select
* except(vehicleid),
-- let's cast here instead of in the join clause
cast(vehicleid as string) as vehicleId
from `BikeLogsTable`
where to is not null -- filter this out early too!
),
joined as (
select
Bikes.vehicleId, Bikes.timetsamp_field_24, States.to, States.vehiclestatechangeid, States.timetstamp as states_ts
from States
left join Bikes using(vehicleId) -- maybe inner join depending on your actual data
where Bikes.ts1 < States.timestamp
and Bikes.ts2 > States.timestamp
and Bikes.ts2 > '2021-11-22'
and Bikes.ts2 < '2021-11-24'
)
select
*,
row_number() over(partition by timestamp_field_24, vehicleId order by states_ts) as rn -- do ordering last!
from joined
这可能需要进行一些额外的编辑才能获得您想要的结果,但这个总体思路应该会加快您的速度!
【讨论】:
这是一个很好的建议,谢谢。但是我最终使用了一个按日期和 BikeId 的聚集表以上是关于操作超时 - BigQuery 优化窗口功能的主要内容,如果未能解决你的问题,请参考以下文章