在 BigQuery 中,带有空值数组列的“where”子句导致问题

Posted

技术标签:

【中文标题】在 BigQuery 中,带有空值数组列的“where”子句导致问题【英文标题】:In BigQuery, "where" clause with array-column of null values causing issue 【发布时间】:2021-08-02 15:43:26 【问题描述】:

这是去年发布的BigQuery - Compute 0 - 100 percentiles for multiple columns, over multiple groups 的后续内容。该问题与计算表中多个列的 0-100 个百分位数有关。下面是一个可重现的示例。该帖子看起来很长,但主要是可重现的示例+输出屏幕截图,以帮助解决问题:

with
    raw_data as (
        select 24997 as competitionId, 0.9167 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.7778 as ft2Pct, 0.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8125 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.5625 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.6842 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.7317 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8333 as ft2Pct, 0.5 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8000 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.7500 as ft2Pct, null as ft3Pct, 1.0 as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.6944 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.7500 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.9091 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.6667 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8261 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8108 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.7895 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8571 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.7727 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8333 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.6923 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8571 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.9268 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.7660 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, 0.8333 as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8636 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8036 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.9000 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
        select 24997 as competitionId, 0.8108 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct
    ),


  -- A) Positive Percentiles
  -- A1) compute quantiles: will be saved in messy arrays
  positive_pctile_arrays as (
    select
      competitionId
      ,approx_quantiles(ft2Pct, 10) as ft2Pct
      ,approx_quantiles(ft3Pct, 10) as ft3Pct
      ,approx_quantiles(ftTechPct, 10) as ftTechPct
      ,approx_quantiles(ftFlagPct, 10) as ftFlagPct
    from raw_data
    group by 1
  ),

  -- A2) and unnest arrays
  positive_pctiles as (
    select
      competitionId
      ,pctile
      ,ft2Pct
      ,ft3Pct
      ,ftTechPct
      ,ftFlagPct
    from positive_pctile_arrays as a
      ,a.ft2Pct with offset as pctile
      ,a.ft3Pct with offset as ft3PctPctile
      ,a.ftTechPct with offset as ftTechPctPctile
      ,a.ftFlagPct with offset as ftFlagPctPctile
    where
      pctile = ft3PctPctile  and 
      pctile = ftTechPctPctile  and 
      pctile = ftFlagPctPctile
  )

-- select * from raw_data
select * from positive_pctile_arrays
-- select * from positive_pctiles

几个cmets:

我们按competitionId 分组,因为我们的完整数据有>1 个competitionId,即使示例只有1 个。 我们希望为这些值计算 0 - 100 个百分位数,但在此示例中,为简洁起见,我们使用 approx_quantiles(., 10) 而不是 approx_quantiles(., 100)

在我们的数据中,ftFlagPct 的所有值都为空。因此,在 A1 positive_pctile_arrays 中,ftFlagPct 列是空白的。

因此,当我们尝试在 A2 中取消嵌套这些数组时,看起来where 子句过滤掉了所有行。如果您取消注释select * from positive_pctiles,此最终输出表将为空。

如果我们将 A1A2 中的 ftFlagPct 注释掉,我们大多会得到我们想要的未嵌套表:

我们想要的输出是这个表,有一个额外的 ftFlagPct 列,其中包含所有空值。看来我们需要查询来检测positive_pctile_arrays 中的ftFlagPct 数组列是否为空/空,然后以某种方式处理左连接?

编辑:我们正在研究一种解决方案,我们使用一组虚拟值(例如,全部 999999)识别并替换空数组,然后在最后用空值替换 999999输出。如果我们能解决这个问题,我们会发布答案。

【问题讨论】:

【参考方案1】:

所以,替换

approx_quantiles(ftFlagPct, 10)

,case
    when array_length(approx_quantiles(ftFlagPct, 10)) is null then generate_array(999990, 1000000, 1)
    else approx_quantiles(ftFlagPct, 10)
end as ftFlagPct

...在保留最终输出表的范围内起作用(未过滤到 0 行),ftFlagPct 列中有 11 个值 999990、999991、...、1000000。我们无论如何都不喜欢这个解决方案,但它为我们提供了一些可以使用的东西,我们现在可以轻松地将这些值替换为 null 值。非常愿意接受更简洁的答案!

然后我们可以简单地将,case when ftFlagPct > 999989 then null else ftFlagPct end as ftFlagPct添加到最后一个查询的select语句中......

编辑:这会引发类型 == float 的错误,并且我们的数据包含浮点数和整数,因此我们仍在努力。

【讨论】:

以上是关于在 BigQuery 中,带有空值数组列的“where”子句导致问题的主要内容,如果未能解决你的问题,请参考以下文章

在 Google Data Studio 中显示重复列的空值的问题

如何删除 BigQuery 数组中的空值?

如何在 BigQuery 标准 SQL 中获取数组的切片?

带有日期列的 BigQuery 标准 SQL 表通配符

BigQuery 没有以毫秒为单位处理带有分区列的时间戳

Scala:从带有列的csv读取数据具有空值