根据google BigQuery SQL中的属性删除重复行
Posted
技术标签:
【中文标题】根据google BigQuery SQL中的属性删除重复行【英文标题】:Remove duplicate rows according to the attribute in google BigQuery SQL 【发布时间】:2017-05-09 09:04:20 【问题描述】:我有一张表叫:结果 我正在使用 BigQuery 从 GA 中选择数据
SELECT
Date,
totals.pageviews,
h.transaction.transactionId,
h.item.itemQuantity,
h.transaction.transactionRevenue,
totals.bounces,
fullvisitorid,
totals.timeOnSite,
device.browser,
device.deviceCategory,
trafficSource.source,
channelGrouping,
h.page.pagePath,
h.eventInfo.eventCategory,
device.operatingSystem
FROM
`atomic-life-148403.126959513.ga_sessions_*`,
UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1 YEAR) AS STRING), '-','')
AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
ORDER BY
date DESC
有一些记录重复。如何从表中删除重复记录?
我想得到以下结果。
【问题讨论】:
您确实想查找并删除行,或者只是将它们从查询结果中隐藏?如果是后者,请使用 DISTINCT。如果是前者,它会变得更复杂一些。 如何只选择不同的行?因为 itemquentity 和收入是分开的 【参考方案1】:您可以使用 ROW_NUMBER
WITH CTE AS
(SELECT *, ROW_NUMBER() OVER (PARTITION BY transactionid ORDER BY
transactionid) ROW FROM [YourTable])
DELETE [YourTable]
FROM [YourTable]
JOIN CTE ON [YourTable].transactionid ON CTE.transactionid
WHERE CTE.ROW > 1
【讨论】:
【参考方案2】:你可以使用ROW_NUMBER()
这样的解析函数
select * from (
select *,
ROW_NUMBER() OVER(PARTITION BY transactionid ORDER BY transactionid) rownum
from result ) xxx
where rownum = 1;
【讨论】:
【参考方案3】:以下是 BigQuery 标准 SQL
#standardSQL
SELECT DISTINCT
Date,
totals.pageviews,
h.transaction.transactionId,
h.item.itemQuantity,
h.transaction.transactionRevenue,
totals.bounces,
fullvisitorid,
totals.timeOnSite,
device.browser,
device.deviceCategory,
trafficSource.source,
channelGrouping,
h.page.pagePath,
h.eventInfo.eventCategory,
device.operatingSystem
FROM
`atomic-life-148403.126959513.ga_sessions_*`,
UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1 YEAR) AS STRING), '-','')
AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
ORDER BY
date DESC
如您所见 - 我刚刚将 DISTINCT 添加到您的 SELECT 中 - 详细了解 BigQuery 标准 SQL 的 SELECT and its modifiers
【讨论】:
【参考方案4】:您可以选择唯一的行并删除其他行:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT DISTINCT * FROM MyTable
) as UniqueRows ON
MyTable.KeyField= UniqueRows.KeyField
WHERE
UniqueRows.KeyField IS NULL;
【讨论】:
【参考方案5】:对所有选定的列使用 GROUP BY
应该可以消除结果中任何真正的重复行:
SELECT
Date,
totals.pageviews,
h.transaction.transactionId,
h.item.itemQuantity,
h.transaction.transactionRevenue,
totals.bounces,
fullvisitorid,
totals.timeOnSite,
device.browser,
device.deviceCategory,
trafficSource.source,
channelGrouping,
h.page.pagePath,
h.eventInfo.eventCategory,
device.operatingSystem
FROM
`atomic-life-148403.126959513.ga_sessions_*`,
UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL -1
YEAR) AS STRING), '-','')
AND CONCAT('intraday_', REPLACE(CAST(DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY) AS STRING), '-',''))
GROUP BY
Date,
totals.pageviews,
h.transaction.transactionId,
h.item.itemQuantity,
h.transaction.transactionRevenue,
totals.bounces,
fullvisitorid,
totals.timeOnSite,
device.browser,
device.deviceCategory,
trafficSource.source,
channelGrouping,
h.page.pagePath,
h.eventInfo.eventCategory,
device.operatingSystem
ORDER BY
date DESC;
【讨论】:
以上是关于根据google BigQuery SQL中的属性删除重复行的主要内容,如果未能解决你的问题,请参考以下文章
bigQuery Google Drive Sheets 一张表中的多个工作表
从 Google BigQuery 标准 SQL 中的数组生成随机值
需要帮助根据 BigQuery 中的值将 Google Cloud Storage 中的特定 PDF 文件移动到 SFTP