Bigquery unnest hits - 重复值)
Posted
技术标签:
【中文标题】Bigquery unnest hits - 重复值)【英文标题】:Bigquery unnest hits - duplicating values) 【发布时间】:2017-06-23 14:28:33 【问题描述】:我试图在导入到大查询中的属性中创建一个组的主视图,但似乎通过使用 unnest(hits) SQL 正在复制数据,导致收入等值不准确......
我试图了解为什么 unnest 会导致这种情况,但我无法弄清楚。
SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
FROM
(SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
UNION ALL
SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
Group By Date, hostname, channelGrouping
Order by Date
【问题讨论】:
【参考方案1】:这可能会奏效:
SELECT
date,
channelGrouping,
SUM(Revenue) Revenue,
SUM(Shipping) Shipping,
SUM(bounces) bounces,
SUM(transactions) transactions,
hostname,
COUNT(date) sessions
FROM(
SELECT
date,
channelGrouping,
totals.totaltransactionrevenue / 1e6 Revenue,
ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
(SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
totals.bounces bounces,
totals.transactions transactions
FROM `project_id.dataset_id.ga_sessions_*`
WHERE 1 = 1
AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'
UNION ALL
(...)
),
UNNEST(hostnames) hostname
GROUP BY
date, channelGrouping, hostname
请注意,在此查询中,我避免在 hits
字段中应用 UNNEST
操作,并且只在 subselects 中这样做。
要了解为什么会出现这种情况,您必须了解 ga data 是如何聚合到 BigQuery 中的。请注意,我们基本上有两种类型的数据:session
级别数据和hits
级别。每个访问您网站的客户最终都会在 BigQuery 中生成一行,如下所示:
fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [hitNumber: 1, page: hostname: "yourserverhostname", hitNumber: 2, page: hostname: "yourserverhostname", totals: totalTransactionRevenue:0, bounces: 0]
如果同一位客户在一天后回来,它会在 BQ 中生成另一行,例如:
fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [hitNumber: 1, page: hostname: "yourserverhostname", hitNumber: 2, page: hostname: "yourserverhostname", totals: totalTransactionRevenue:50000000, bounces: 2]
如您所见,键 hits
之外的字段与会话级别相关(因此,每次点击,即客户在您的网站上进行的每次互动,都会在此处添加另一个条目)。当您应用UNNEST
时,基本上,您将包含数组内所有值的cross-join 应用到外部字段。
这就是重复发生的地方!
鉴于过去的示例,如果我们将 UNNEST
应用于 hits
字段,您最终会得到类似的结果:
fullvisitorid visitid totals.totalTransactionRevenue hits.hitNumber
1 1 0 1
1 1 0 2
1 2 50000000 1
1 2 50000000 2
请注意,hits
字段内的每个命中都会导致外部字段(例如 totals.totalTransactionRevenue
)与 hits
ARRAY 内发生的每个 hitNumber
重复。
因此,如果稍后应用SUM(totals.totalTransactionRevenue)
之类的操作,您最终会将此字段乘以客户在该visitid
中的每次点击。
我倾向于避免(成本取决于数据量)UNNEST
对 hits
字段的操作,我只在子查询中这样做(取消嵌套仅发生在行级别,而行级别不会重复数据)。
【讨论】:
我正要发表我自己的回复,但你打败了我 :) 不错的答案!我试图从原始问题中弄清楚的一件事是hostname
分组的意图是什么。由于主机名分布在 hits
中,因此尝试将总和与其关联似乎没有意义。
感谢@ElliottBrossard 的评论 :)!我同意hostname
,不知道为什么 OP 想要这个字段。我在 out 网站上注意到的是,对于将近 20% 的收入,客户已经在购物车/结帐页面中打开了他们的会话,所以我认为也许通过按 hostname
分组,可以对这种类型的流量(但不确定这是否也是 OP 的真正意图)。
感谢您的帮助,这些属性正在处理多个网站,所以我希望有一种简单的方法来过滤网站的主机名。我试图让其中一些数据与 datastudio 一起使用。以上是关于Bigquery unnest hits - 重复值)的主要内容,如果未能解决你的问题,请参考以下文章
查看 Google Analytics 时如何 UNNEST 和展平 BigQuery 中的所有记录