google大查询sql中的性能增强

Posted

技术标签:

【中文标题】google大查询sql中的性能增强【英文标题】:Performance enhancement in google big query sql 【发布时间】:2017-03-10 18:28:26 【问题描述】:

在下面的谷歌大查询中,我在 Id、StartTime 和 StopTime 上加入了两个表“数据”和“位置”。

由于数据是按日期分区的,因此我在 WHERE 子句中具有基于 PartitionTime 的条件。

查询运行了很长时间(约 20 分钟),只是想知道我是否缺少一些性能技术来提高查询效率。

任何帮助将不胜感激。谢谢!!

  SELECT
    *
  FROM (
      SELECT
          A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
      FROM
          `Data` AS A
      JOIN
        (SELECT * FROM `Location` WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
        "19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )) AS B
      ON
        A.StartTime < B.DateTime
        AND A.StopTime >= B.DateTime
        AND A.Id = B.Id
  WHERE
    (A._PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01')
      AND TIMESTAMP('2016-11-30'))
  ORDER BY
    B.Id,
    A.Id1,
    B.DateTime )
ORDER BY
  Id,
  Id1,
  DateTime

【问题讨论】:

【参考方案1】:

一些想法:

不需要内部ORDER BY,因为只有***ORDER BY 对查询结果有影响。 如果要查询除"25"以外的所有后缀,可以使用_TABLE_SUFFIX BETWEEN "01" AND "31" AND _TABLE_SUFFIX != "25"。 根据JOIN 的类型,_PARTITIONTIME 上的过滤器可能不会被“下推”以避免自动读取额外数据,例如如果您实际使用的是RIGHT JOIN。如果是这种情况,请改用 (SELECT * FROM YourTable WHERE _PARTITIONTIME BETWEEN ...) AS A RIGHT JOIN ... 等子查询。

如果您希望 BigQuery 工程师更详细地了解时间的流逝,您可以在问题中包含示例作业 ID,有人可能会提供帮助。

【讨论】:

【参考方案2】:

我还会删除外部 ORDER BY,因为我认为它是您查询性能的主要杀手。 将_PARTITIONTIME 移动到相应的表是另一个需要考虑的事项。 在子选择中使用SELECT * 不会影响性能和成本(因为它是最终的外部SELECT,它定义了除了WHERE 和其他子句中使用的列之外还使用哪些列),但作为一个好习惯,我认为更好列出明确需要的列/字段

#standardSQL
SELECT
  A.Id AS Id, A.Id1 AS Id1, StartTime, StopTime, Latitude, Longitude, DateTime
FROM (
  SELECT Id, Id1, StartTime, StopTime 
  FROM `Data` 
  WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-11-01') AND TIMESTAMP('2016-11-30')
) AS A
JOIN (
  SELECT Latitude, Longitude, DateTime 
  FROM `Location` 
  WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )
) AS B
ON  A.StartTime < B.DateTime
AND A.StopTime >= B.DateTime
AND A.Id = B.Id   

您也可以考虑按照 Elliott 的建议“压缩”下面的语句,

WHERE _TABLE_SUFFIX IN ("01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18",
"19","20","21", "22", "23","24", "26", "27", "28","29","30","31" )  

但要小心,因为这可能会涉及不需要的表(如果您的数据集中有这样的表)。例如,后缀为“011”或“046”等的那些。

另一个选项是 - 您可能在Data 中的分区和Location 中的后缀之间存在某种逻辑关系。如果是这样,您可以使用它来缩小 JOIN 范围,从而提高性能

【讨论】:

@user3447653 - 如果我的回答对您有所帮助并且您接受了它 - 请也考虑投票 :o) 这样做可以保持激励我和那些在您提出问题时准备回答您的问题的人。感谢您考虑

以上是关于google大查询sql中的性能增强的主要内容,如果未能解决你的问题,请参考以下文章

如何将字符串日期列转换为 Google 大查询中的日期列?

GBQexception:如何使用存储在 Google Drive 电子表格中的大查询读取数据

大查询中的Google AdWords转帐:可以更改表格架构吗?

将旧版 SQL 转换为标准 SQL - 增强型电子商务

如何在本地使用 java 连接到带有 spark 的 Google 大查询?

Google Big Table与广告服务器数据分析的大查询