BigQuery - 根据匹配值或时间戳组合三个表

Posted

技术标签:

【中文标题】BigQuery - 根据匹配值或时间戳组合三个表【英文标题】:BigQuery - Combine three tables based on matching value or timestamp 【发布时间】:2016-11-30 19:57:01 【问题描述】:

类似于提问BigQuery combine tables based on closest timerstamp and matching value

我有三个表,对于表 numberTwo 的每一行,我需要在表 numberOne 中获取具有相同 hint >cod 值以及在比较 time1time2 时具有 最接近时间 的值。如果 cod 未显示在表 numberOne 中,它会尝试获取与表 中的 cod 匹配的 hint numberThree

为了更容易理解我需要做的是:

表号一:

|  id |  cod  |   hint  |           time1         |
---------------------------------------------------
|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |
|  2  |  ABC  |    W    | 2016-11-03 12:00:00 UTC |
|  3  |  CDE  |    X    | 2016-11-03 19:00:00 UTC |
|  4  |  CDE  |    Y    | 2016-11-03 19:30:00 UTC |
|  5  |  EFG  |    Z    | 2016-11-03 18:00:00 UTC |

表号二

|  id |  cod  |   value  |         time2           |
----------------------------------------------------
|  1  |  ABC  |   xyz2   | 2016-11-03 18:20:00 UTC |
|  2  |  FHK  |   h323   | 2016-11-03 11:30:00 UTC |
|  3  |  ABC  |   rewq   | 2016-11-03 09:00:00 UTC |
|  4  |  IJK  |   abce   | 2016-11-03 19:10:00 UTC |

表号三

|  id |  cod  |   hint   |
--------------------------
|  1  |  FHK  |   tes1   |
|  2  |  IJK  |   tes2   |
|  3  |  MNK  |   tes3   |
|  4  |  MOP  |   tes4   |

因此,对于表 numberTworow #1,我将使用 cod: ABC 获取表 numberOne 中的所有行强>

|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |
|  2  |  ABC  |    W    | 2016-11-03 12:00:00 UTC |

在这两者之间,我会得到一个与 time2 最接近的时间戳

|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |

如果 cod 未显示在表 numberOne 中,则它与表 numberThree 匹配。 numberOnenumberThree 中的代码是唯一的。因此,不会在两个表中显示相同的代码。所以它可以先尝试匹配表 numberThree

处理完每一行后,我会得到一个像这样的表格:

所需的表

|  id |  cod  |   hint  |   value  |         time2           |
--------------------------------------------------------------
|  1  |  ABC  |    V    |   xyz2   | 2016-11-03 18:20:00 UTC |
|  2  |  FHK  |   tes1  |   h323   |                         |
|  3  |  ABC  |    W    |   rewq   | 2016-11-03 09:00:00 UTC |
|  4  |  IJK  |   tes2  |   abce   |                         |

【问题讨论】:

numberThree 的架构是什么?什么是预期的输出模式? + 请举个简单的例子! @MikhailBerlyant 我已经更新了问题。 这项工作成功了吗? 你看到***.com/a/40941368/5221944了吗? 【参考方案1】:

下面试试

WITH 
/*
TableNumberOne AS (
  SELECT 1 AS id, 'ABC' AS cod, 'V' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 UNION ALL
  SELECT 2 AS id, 'ABC' AS cod, 'W' AS hint, TIMESTAMP '2016-11-03 12:00:00 UTC' AS time1 UNION ALL
  SELECT 3 AS id, 'CDE' AS cod, 'X' AS hint, TIMESTAMP '2016-11-03 19:00:00 UTC' AS time1 UNION ALL
  SELECT 4 AS id, 'CDE' AS cod, 'Y' AS hint, TIMESTAMP '2016-11-03 19:30:00 UTC' AS time1 UNION ALL
  SELECT 5 AS id, 'EFG' AS cod, 'Z' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 
),
TableNumberTwo AS (
  SELECT 1 AS id, 'ABC' AS cod, 'xyz2' AS value, TIMESTAMP '2016-11-03 18:20:00 UTC' AS time2 UNION ALL
  SELECT 2 AS id, 'FHK' AS cod, 'h323' AS value, TIMESTAMP '2016-11-03 11:30:00 UTC' AS time2 UNION ALL
  SELECT 3 AS id, 'ABC' AS cod, 'rewq' AS value, TIMESTAMP '2016-11-03 09:00:00 UTC' AS time2 UNION ALL
  SELECT 4 AS id, 'IJK' AS cod, 'abce' AS value, TIMESTAMP '2016-11-03 19:10:00 UTC' AS time2 
),
TableNumberThree AS (
  SELECT 1 AS id, 'FHK' AS cod, 'test1' AS hint UNION ALL
  SELECT 2 AS id, 'IJK' AS cod, 'test2' AS hint UNION ALL
  SELECT 3 AS id, 'MNK' AS cod, 'test3' AS hint UNION ALL
  SELECT 4 AS id, 'MOP' AS cod, 'test4' AS hint 
),
*/
tempTable AS (
  SELECT 
    t2.id, t2.cod, t2.value, t2.time2, t1.hint, 
    ROW_NUMBER() OVER(PARTITION BY t2.id, t2.cod, t2.value 
                      ORDER BY ABS(TIMESTAMP_DIFF(t2.time2, t1.time1, SECOND))) AS win
  FROM TableNumberTwo AS t2
  LEFT JOIN TableNumberOne AS t1
  ON t1.cod = t2.cod
)
SELECT 
  t1.id, t1.cod, IFNULL(t1.hint, t2.hint) AS hint, value, 
  IF(t1.hint IS NULL, NULL, time2) as time2
FROM tempTable AS t1
LEFT JOIN TableNumberThree AS t2
ON t1.cod = t2.cod AND t1.hint IS NULL
WHERE win = 1

【讨论】:

以上是关于BigQuery - 根据匹配值或时间戳组合三个表的主要内容,如果未能解决你的问题,请参考以下文章

Bigquery 无法加载数据日期值或时间戳值格式错误

将三个具有唯一时间戳的数据库表匹配的最佳方法是啥?

根据事件时间戳组合行

如何根据时间戳匹配值,当时间戳不存在时,该值是前一个时间戳的值

INNER JOIN 基于彼此范围内的公共时间戳的两个 BigQuery 表?

在bigquery中转换整个表的时间戳