BigQuery 中使用 Google Analytics 导出数据的队列/保留查询
Posted
技术标签:
【中文标题】BigQuery 中使用 Google Analytics 导出数据的队列/保留查询【英文标题】:Cohort/ Retention query in BigQuery using Google Analytics exported data 【发布时间】:2017-08-01 15:16:31 【问题描述】:我需要帮助制定队列/保留查询
我正在尝试构建一个查询来查看在第一次访问时(在时间范围内)执行 ActionX 的访问者,然后他们在多少天后返回再次执行 Action X
我(最终)需要的输出看起来像这样......
我正在处理的表是从 Google Analytics 导出到 BigQuery
如果有人可以帮助我解决这个问题,或者有人写过类似的查询,我可以操作吗?
谢谢
【问题讨论】:
您好,欢迎来到 Stack Overflow,请花点时间通过 welcome tour 了解您在此处的方式(并获得您的第一个徽章),阅读如何创建 minimal reproducible example示例并检查How to Ask,这样您就可以增加获得反馈和有用答案的机会。 【参考方案1】:只是给你简单的想法/方向
以下是 BigQuery 标准 SQL
#standardSQL
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
您可以使用您问题中的以下虚拟数据对其进行测试
#standardSQL
WITH `OutputFromQuery` AS (
SELECT '01.07.17' AS Date_of_action_first_taken, 1000 AS Visits, 800 AS later_1_day, 400 AS later_2_days, 300 AS later_3_days UNION ALL
SELECT '02.07.17', 1000, 860, 780, 860 UNION ALL
SELECT '29.07.17', 1000, 780, 120, 0 UNION ALL
SELECT '30.07.17', 1000, 710, 0, 0
)
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
OutputFromQuery
数据如下:
Date_of_action_first_taken Visits later_1_day later_2_days later_3_days
01.07.17 1000 800 400 300
02.07.17 1000 860 780 860
29.07.17 1000 780 120 0
30.07.17 1000 710 0 0
最后的输出是:
Date_of_action_first_taken later_1_day later_2_days later_3_days
01.07.17 80.0 40.0 30.0
02.07.17 90.0 78.0 86.0
29.07.17 80.0 12.0 0.0
30.07.17 70.0 0.0 0.0
【讨论】:
谢谢米哈伊尔!这有助于给我一种味道。如果您能够检查并让我知道您的想法,我已经提出了我的查询(或我必须在上面提到的地方)?感谢您的回复!【参考方案2】:我在Turn Your App Data into Answers with Firebase and BigQuery (Google I/O'19)找到了这个查询
它应该工作:)
#standardSQL
###################################################
# Part 1: Cohort of New Users Starting on DEC 24
###################################################
WITH
new_user_cohort AS (
SELECT DISTINCT
user_pseudo_id as new_user_id
FROM
`[your_project].[your_firebase_table].events_*`
WHERE
event_name = `[chosen_event] ` AND
#set the date from when starting cohort analysis
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) = '20191224' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
),
num_new_users AS (
SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),
#############################################
# Part 2: Engaged users from Dec 24 cohort
#############################################
engaged_users_by_day AS (
SELECT
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) as event_day,
COUNT(DISTINCT user_pseudo_id) as num_engaged_users
FROM
`[your_project].[your_firebase_table].events_*`
INNER JOIN
new_user_cohort ON new_user_id = user_pseudo_id
WHERE
event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
GROUP BY
event_day
)
####################################################################
# Part 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT
event_day,
num_engaged_users,
num_users_in_cohort,
ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
event_day
【讨论】:
这个查询是正确的,但它只会为您提供对 24 日执行操作的同类群组的分析。在他的问题中,用户要求提供一些可以同时分析多个群组的东西。您可以通过在第 1 步的子查询中添加日期来改进此查询,并在加入参与用户的日期时,包括参与用户群组的日期(来自 new_user_cohort)。【参考方案3】:所以我想我可能已经破解了它......然后我需要从这个输出中对其进行操作(数据透视表)以使其看起来像所需的输出。
谁能帮我复习一下,让我知道你的想法?
`WITH
cohort_items AS (
SELECT
MIN( TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 +
h.time*1000)), DAY) ) AS cohort_day, fullVisitorID
FROM
TABLE123 AS U,
UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN "20170701" AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 2
),
user_activites AS (
SELECT
A.fullVisitorID,
DATE_DIFF(DATE(TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 + h.time*1000)), DAY)), DATE(C.cohort_day), DAY) AS day_number
FROM `Table123` A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID,
UNNEST(hits) AS h
WHERE
A._TABLE_SUFFIX BETWEEN "20170701 AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 1,2),
cohort_size AS (
SELECT
cohort_day,
count(1) as number_of_users
FROM
cohort_items
GROUP BY 1
ORDER BY 1
),
retention_table AS (
SELECT
C.cohort_day,
A.day_number,
COUNT(1) AS number_of_users
FROM
user_activites A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID
GROUP BY 1,2
)
SELECT
B.cohort_day,
S.number_of_users as total_users,
B.day_number,
B.number_of_users / S.number_of_users as percentage
FROM retention_table B
LEFT JOIN cohort_size S ON B.cohort_day = S.cohort_day
WHERE B.cohort_day IS NOT NULL
ORDER BY 1, 3
`
提前谢谢你!
【讨论】:
【参考方案4】:如果您使用 BigQuery 中提供的一些技术,则可以通过成本和性能非常有效的解决方案来解决此类问题。举个例子:
SELECT
init_date,
ARRAY((SELECT AS STRUCT days, freq, ROUND(freq * 100 / MAX(freq) OVER(), 2) FROM UNNEST(data) ORDER BY days)) data
FROM(
SELECT
init_date,
ARRAY_AGG(STRUCT(days, freq)) data
FROM(
SELECT
init_date,
data AS days,
COUNT(data) freq
FROM(
SELECT
init_date,
ARRAY(SELECT DATE_DIFF(PARSE_DATE("%Y%m%d", dts), PARSE_DATE("%Y%m%d", init_date), DAY) AS dt FROM UNNEST(dts) dts) data
FROM(
SELECT
MIN(date) init_date,
ARRAY_AGG(DISTINCT date) dts
FROM `Table123`
WHERE TRUE
AND EXISTS(SELECT 1 FROM UNNEST(hits) where eventinfo.eventCategory = 'recommendation') -- This is your 'ACTION TAKEN' filter
AND _TABLE_SUFFIX BETWEEN "20170724" AND "20170731"
GROUP BY fullvisitorid
)
),
UNNEST(data) data
GROUP BY init_date, days
)
GROUP BY init_date
)
我根据我们的 G.A 数据和选择的与我们的推荐系统交互的客户测试了这个查询(正如您在过滤器选择中看到的那样 WHERE EXISTS...
)。结果示例(出于隐私原因,省略了频率的绝对值):
如您所见,例如在第 28 天,8% 的客户在 1 天后回来并再次与系统交互。
我建议您尝试使用此查询,看看它是否适合您。它更简单、更便宜、更快,并且希望更容易维护。
【讨论】:
以上是关于BigQuery 中使用 Google Analytics 导出数据的队列/保留查询的主要内容,如果未能解决你的问题,请参考以下文章
在 Google 表格中使用 BigQuery,如何授予其他用户按“刷新”的权限?
在 google bigquery 中,如何使用 google python 客户端使用 javascript UDF
使用 Google.Cloud.BigQuery.V2 的 BigQuery 加载作业的幂等性