根据google BigQuery SQL中的属性删除重复行

Posted

技术标签:

【中文标题】根据google BigQuery SQL中的属性删除重复行【英文标题】:Remove duplicate rows according to the attribute in google BigQuery SQL 【发布时间】:2017-05-09 09:04:20 【问题描述】:

我有一张表叫:结果 我正在使用 BigQuery 从 GA 中选择数据

SELECT
  Date,
  totals.pageviews,
  h.transaction.transactionId,
  h.item.itemQuantity,
  h.transaction.transactionRevenue,
  totals.bounces,
  fullvisitorid,
  totals.timeOnSite,
  device.browser,
  device.deviceCategory,
  trafficSource.source,
  channelGrouping,
  h.page.pagePath,
  h.eventInfo.eventCategory,
  device.operatingSystem
FROM
  `atomic-life-148403.126959513.ga_sessions_*`,
  UNNEST(hits) AS h
WHERE
  _TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1 YEAR) AS STRING), '-','')
  AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
  ORDER BY
  date  DESC

有一些记录重复。如何从表中删除重复记录?

我想得到以下结果。

【问题讨论】:

您确实想查找并删除行,或者只是将它们从查询结果中隐藏?如果是后者,请使用 DISTINCT。如果是前者,它会变得更复杂一些。 如何只选择不同的行?因为 itemquentity 和收入是分开的 【参考方案1】:

您可以使用 ROW_NUMBER

WITH CTE AS 
(SELECT *, ROW_NUMBER() OVER (PARTITION BY transactionid ORDER BY 
transactionid) ROW FROM [YourTable]) 

DELETE [YourTable] 
FROM [YourTable]
JOIN CTE ON [YourTable].transactionid ON CTE.transactionid
                              WHERE CTE.ROW > 1

【讨论】:

【参考方案2】:

你可以使用ROW_NUMBER()这样的解析函数

select * from (
select *,
ROW_NUMBER() OVER(PARTITION BY transactionid ORDER BY transactionid) rownum
from result ) xxx
where rownum = 1;

【讨论】:

【参考方案3】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT DISTINCT
  Date,
  totals.pageviews,
  h.transaction.transactionId,
  h.item.itemQuantity,
  h.transaction.transactionRevenue,
  totals.bounces,
  fullvisitorid,
  totals.timeOnSite,
  device.browser,
  device.deviceCategory,
  trafficSource.source,
  channelGrouping,
  h.page.pagePath,
  h.eventInfo.eventCategory,
  device.operatingSystem
FROM
  `atomic-life-148403.126959513.ga_sessions_*`,
  UNNEST(hits) AS h
WHERE
  _TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1 YEAR) AS STRING), '-','')
  AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
  ORDER BY
  date  DESC  

如您所见 - 我刚刚将 DISTINCT 添加到您的 SELECT 中 - 详细了解 BigQuery 标准 SQL 的 SELECT and its modifiers

【讨论】:

【参考方案4】:

您可以选择唯一的行并删除其他行:

DELETE FROM MyTable
LEFT OUTER JOIN (
   SELECT DISTINCT * FROM MyTable
) as UniqueRows ON
   MyTable.KeyField= UniqueRows.KeyField
WHERE
   UniqueRows.KeyField IS NULL;

【讨论】:

【参考方案5】:

对所有选定的列使用 GROUP BY 应该可以消除结果中任何真正的重复行:

SELECT
  Date,
  totals.pageviews,
  h.transaction.transactionId,
  h.item.itemQuantity,
  h.transaction.transactionRevenue,
  totals.bounces,
  fullvisitorid,
  totals.timeOnSite,
  device.browser,
  device.deviceCategory,
  trafficSource.source,
  channelGrouping,
  h.page.pagePath,
  h.eventInfo.eventCategory,
  device.operatingSystem
FROM
  `atomic-life-148403.126959513.ga_sessions_*`,
  UNNEST(hits) AS h
WHERE
  _TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1 
YEAR) AS STRING), '-','')
  AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
GROUP BY
  Date,
  totals.pageviews,
  h.transaction.transactionId,
  h.item.itemQuantity,
  h.transaction.transactionRevenue,
  totals.bounces,
  fullvisitorid,
  totals.timeOnSite,
  device.browser,
  device.deviceCategory,
  trafficSource.source,
  channelGrouping,
  h.page.pagePath,
  h.eventInfo.eventCategory,
  device.operatingSystem
ORDER BY
  date  DESC;

【讨论】:

以上是关于根据google BigQuery SQL中的属性删除重复行的主要内容,如果未能解决你的问题,请参考以下文章

bigQuery Google Drive Sheets 一张表中的多个工作表

在 sql google BigQuery 中访问数组

从 Google BigQuery 标准 SQL 中的数组生成随机值

需要帮助根据 BigQuery 中的值将 Google Cloud Storage 中的特定 PDF 文件移动到 SFTP

使用 Google BigQuery 从 JSON 中的多个属性值中提取值

Google BigQuery - 根据另一列中的值减去一列的 SUM