操作超时 - BigQuery 优化窗口功能

Posted

技术标签:

【中文标题】操作超时 - BigQuery 优化窗口功能【英文标题】:Operation timed out - BigQuery optimizing window function 【发布时间】:2021-12-18 16:57:48 【问题描述】:

我非常感谢任何优化以下查询的建议 由于此查询在当前状态下超时。由于BikeLogsTable 包含大约 10000 - 1 百万行,可以在每一行上加入,因此,此查询在 6 小时后超时。

请注意,这里的核心目标是将 BikesTable 的每一行与州中的相关行相关联。在这一步中,我试图确定 BikeLogTable 的关联行并检索行号以确定最后一个关联条目。

Select Bikes.vehicleId,Bikes.timestamp_field_24,States.to,States.vehiclestatechangeid
ROW_NUMBER() OVER (
    PARTITION BY Bikes.timestamp_field_24,Bikes.vehicleId 
    ORDER BY States.timestamp
)

from `BikesTable` as Bikes
Right Join `BikeLogsTable` as States
on Bikes.vehicleId = CAST(States.vehicleid as String) 

WHERE 
DATE_SUB( PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) , INTERVAL 1 DAY) < States.timestamp AND
PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) > States.timestamp AND 
States.to is not null and
 PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) > "2021-11-22" and
  PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(Bikes.timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) < "2021-11-24"

执行细节:

由于这是一项一次性任务,我很高兴收到有关古怪解决方法的建议。

【问题讨论】:

一方面,你有很多嵌套的正则表达式。似乎您可以使用更好的正则表达式逻辑来加快这一速度,它可以一次匹配多个模式,也可以在临时表中执行一次,而不是在选择中多次计算同一列。 不幸的是,这不是所需的减少,操作仍然超时。但是,我正在尝试通过聚集表的方法并通过分区 python 函数执行连接。 【参考方案1】:

2 条建议

    重写您的查询,使最大的表位于左侧,在本例中为 BikeLogsTable

    尽早进行尽可能多的转换和过滤,包括在加入之前。

稍微重写你的查询...

with
Bikes as (
    select 
        *,
        -- handle your casting/parsing here instead of multiple times in your where clause 
        DATE_SUB( PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) , INTERVAL 1 DAY) as ts1, 
        PARSE_TIMESTAMP("%F %T", REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(timestamp_field_24, "\\+.*",""), "\\..*", " "), "Z", " "), "T", " ")) as ts2
    from `BikesTable`
),
States as (
    select 
        * except(vehicleid), 
        -- let's cast here instead of in the join clause
        cast(vehicleid as string) as vehicleId 
    from `BikeLogsTable` 
    where to is not null -- filter this out early too!
),
joined as (

    select
        Bikes.vehicleId, Bikes.timetsamp_field_24, States.to, States.vehiclestatechangeid, States.timetstamp as states_ts
    from States
    left join Bikes using(vehicleId) -- maybe inner join depending on your actual data
    where Bikes.ts1 < States.timestamp
      and Bikes.ts2 > States.timestamp
      and Bikes.ts2 > '2021-11-22'
      and Bikes.ts2 < '2021-11-24'
)
select 
    *,
    row_number() over(partition by timestamp_field_24, vehicleId order by states_ts) as rn -- do ordering last!
from joined

这可能需要进行一些额外的编辑才能获得您想要的结果,但这个总体思路应该会加快您的速度!

【讨论】:

这是一个很好的建议,谢谢。但是我最终使用了一个按日期和 BikeId 的聚集表

以上是关于操作超时 - BigQuery 优化窗口功能的主要内容,如果未能解决你的问题,请参考以下文章

DB2 操作超时或死锁

BigQuery 可查看的最大内容

优化繁重的 BigQuery DELETE 查询

查询操作中的 BigQuery 错误:找不到项目 ID

MySQL优化从执行计划开始(explain超详细)

MySQL优化从执行计划开始(explain超详细)