BigQuery 计算两个日期范围重叠

Posted

技术标签:

【中文标题】BigQuery 计算两个日期范围重叠【英文标题】:BigQuery Count Two Date Ranges Overlap 【发布时间】:2020-04-08 05:23:14 【问题描述】:

我最终通过下载所有数据并在 Python 中遍历它来解决了这个问题,但我想知道是否有办法在 BigQuery 中做到这一点。

我们有一个包含开始日期和结束日期的表格:

begin_date, end_date
'2016-02-19', '2016-02-19'
'2016-02-20', '2016-02-25'
'2016-02-21', '2016-02-25'
'2016-02-22', NULL

我们想要 begin_date

SELECT COUNT(*) FROM `table` WHERE begin_date <= '2016-12-19' AND (end_date >= '2016-12-19' OR end_date IS NULL)

因此,如果我为每个感兴趣的值手动执行此操作,所需的输出可能如下所示:

begin_date, count
2016-02-19, 1
2016-02-20, 1
2016-02-21, 2
2016-02-22, 3
2016-02-23, 3
2016-02-24, 3
2016-02-25, 3
2016-02-26, 1
etc.

创建要迭代的日期列表很容易:

WITH dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2018-10-01', '2020-09-30', INTERVAL 1 DAY)) AS example)

现在我正在努力在所有这些日期中应用上述 WHERE 子句。我看到在匹配单个列 (like here) 时,具有范围的分区是如何工作的,但我需要同时匹配 begin_date 和 end_date。

我认为我可以这样做:

SELECT
  status_begin_date,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE (e >= status_begin_date OR e IS NULL)) AS cnt
FROM (
  SELECT
    status_begin_date,
    ARRAY_AGG(status_end_date) OVER(ORDER BY status_begin_date) AS ends
  FROM `table`
)
ORDER BY status_begin_date

取自here。这适用于 *** 答案中给出的小示例,但我在有几亿行的表上使用它时遇到资源错误: BigQuery 中是否有可扩展的解决方案?

【问题讨论】:

你能试试这个代码吗? SELECT begin_date, COUNT(*) FROM 'table' CROSS JOIN dates WHERE begin_date = example OR end_date IS NULL) GROUP BY begin_date ORDER BY begin_date 如果这是您要搜索的内容,请告诉我 @rmesteves 谢谢,但这并没有给出相同的结果。我不确定有什么区别。有时它们的值高于预期,有时低于预期。 @rmesteves 你可以看到结果的差异:sql WITH data AS ( SELECT DATE('2016-02-19') AS begin_date, DATE('2016-02-19') AS end_date UNION ALL SELECT '2016-02-20', '2016-02-25' UNION ALL SELECT '2016-02-21', '2016-02-25' UNION ALL SELECT '2016-02-22', NULL ), dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2016-02-19', '2016-02-26', INTERVAL 1 DAY)) AS example) SELECT begin_date, COUNT(*) FROM data CROSS JOIN dates WHERE begin_date &lt;= example AND (end_date &gt;= example OR end_date IS NULL) GROUP BY begin_date ORDER BY begin_date 你能用这个小改动试试我的代码吗?也许可以优化结果: SELECT example, COUNT(*) FROM 'table' CROSS JOIN dates WHERE begin_date = example OR end_date IS NULL) GROUP BY example ORDER BY example 【参考方案1】:

以下是 BigQuery 标准 SQL,不使用低效游标方法,而是使用基于经典 sql 集的方法

#standardSQL
WITH dates AS (
  SELECT day 
  FROM (SELECT MIN(begin_date) min_date, MAX(end_date) max_date FROM `table`), 
  UNNEST(GENERATE_DATE_ARRAY(min_date, CURRENT_DATE(), INTERVAL 1 DAY)) AS day
)
SELECT day, COUNT(*) 
FROM dates 
JOIN `table` 
ON begin_date <= day AND (end_date >= day OR end_date IS NULL)
GROUP BY day

您可以使用您问题中的示例数据进行测试,如以下示例所示

#standardSQL
WITH `table` AS (
  SELECT DATE '2016-02-19' begin_date, DATE '2016-02-19' end_date UNION ALL
  SELECT '2016-02-20', '2016-02-25' UNION ALL
  SELECT '2016-02-21', '2016-02-25' UNION ALL
  SELECT '2016-02-22', NULL
), dates AS (
  SELECT day 
  FROM (SELECT MIN(begin_date) min_date, MAX(end_date) max_date FROM `table`), 
  UNNEST(GENERATE_DATE_ARRAY(min_date, max_date, INTERVAL 1 DAY)) AS day
)
SELECT day, COUNT(*) 
FROM dates 
JOIN `table` 
ON begin_date <= day AND (end_date >= day OR end_date IS NULL)
GROUP BY day
-- ORDER BY day  

结果

Row day         f0_  
1   2016-02-19  1    
2   2016-02-20  1    
3   2016-02-21  2    
4   2016-02-22  3    
5   2016-02-23  3    
6   2016-02-24  3    
7   2016-02-25  3    

【讨论】:

【参考方案2】:

这个讨厌的代码起作用了:

DECLARE dates ARRAY <DATE>;
DECLARE x INT64 DEFAULT 0;
DECLARE results ARRAY <INT64>;
DECLARE results_dates ARRAY <DATE>;
DECLARE result INT64;
DECLARE date DATE;
SET dates = GENERATE_DATE_ARRAY('2016-02-17', '2019-05-13', INTERVAL 1 DAY);
LOOP
  SET date = dates[OFFSET(x)];
  SET result = (SELECT COUNT(*) FROM `table` WHERE begin_date <= date AND (end_date >= date OR end_date IS NULL));
  SET results = ARRAY_CONCAT(results, [result]);
  SET results_dates = ARRAY_CONCAT(results_dates, [date]);
  SET x = x + 1;
  IF x >= ARRAY_LENGTH(dates) THEN
    LEAVE;
  END IF;
END LOOP;
SELECT date, count_subscribers
FROM UNNEST(results_dates) AS date WITH OFFSET 
JOIN UNNEST(results) AS count_subscribers WITH OFFSET
USING(OFFSET)

运行时间为 1.5 小时,比我的 Python 代码(7 小时)要好,但 BigQuery 代码不可并行化,而 Python 代码可以。

【讨论】:

以上是关于BigQuery 计算两个日期范围重叠的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery:无效日期错误

计算多个日期范围内有多少个重叠日期

计算日期范围的重叠数量

如果我们有很多任务并且每个任务的日期范围可能重叠,如何计算任务的工作天数

sql [BigQuery - Facebook产品目录]查询para obtenerelcatálogodeproductos de Kichink。 #facebook #bigqu

Bigquery - 为每个 id 添加完整的日期范围