Big Query 着陆页数字与 Google Analytics 界面不一致

Posted

技术标签:

【中文标题】Big Query 着陆页数字与 Google Analytics 界面不一致【英文标题】:Big Query landing page figures not consistent with Google Analytics interface 【发布时间】:2017-04-20 13:26:50 【问题描述】:

我正在使用 BigQuery 来报告 Google Analytics(分析)数据。我正在尝试使用 BigQuery 重新创建着陆页数据。

以下查询报告的会话数比 Google Analytics(分析)界面中的少 18%:

SELECT DISTINCT
  fullVisitorId,
  visitID,
  h.page.pagePath AS LandingPage
FROM
  `project-name.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE 
  hitNumber = 1
AND h.type = 'PAGE'
AND _TABLE_SUFFIX BETWEEN '20170331' AND '20170331'
ORDER BY fullVisitorId DESC

我的方法哪里出了问题?为什么我在 GA 界面上报的数字中无法达到一个小范围内的数字?

【问题讨论】:

【参考方案1】:

多种原因:

1.等效登陆页面的大查询:

SELECT
  LandingPage,
  COUNT(sessionId) AS Sessions,
  100 * SUM(totals.bounces)/COUNT(sessionId) AS BounceRate,
  AVG(totals.pageviews) AS AvgPageviews,
  SUM(totals.timeOnSite)/COUNT(sessionId) AS AvgTimeOnSite,
from(
  SELECT
    CONCAT(fullVisitorId,STRING(visitId)) AS sessionID,
    totals.bounces,
    totals.pageviews,
    totals.timeOnSite,
    hits.page.pagePath AS landingPage
  FROM (
    SELECT
      fullVisitorId,
      visitId,
      hits.page.pagePath,
      totals.bounces,
      totals.pageviews,
      totals.timeOnSite,
      MIN(hits.hitNumber) WITHIN RECORD AS firstHit,
      hits.hitNumber AS hitNumber
    FROM (TABLE_DATE_RANGE ([XXXYYYZZZ.ga_sessions_],TIMESTAMP('2016-08-01'), TIMESTAMP ('2016-08-31')))
    WHERE
      hits.type = 'PAGE'
      AND hits.page.pagePath'')
  WHERE
    hitNumber = firstHit)
GROUP BY
  LandingPage
ORDER BY
  Sessions DESC,
  LandingPage

下一步:

预计算数据 -- 预聚合表

这些是 Google 用于加快 UI 速度的预先计算的数据。谷歌没有具体说明何时完成,但可以在任何时候完成。这些被称为预聚合表


因此,如果您将 GA UI 中的数字与您的 Big Query 输出进行比较,您总是会发现差异。请继续并依靠您的大查询数据。

【讨论】:

感谢您的回复@Tushar。如果我理解正确,我的查询只查看了 hitNumber = 1,这就是为什么它低于 18% 的报告。它需要考虑第一次命中未标记为 1 的情况,因此使用 min 函数。此外,即便如此,GA 接口也会牺牲一些准确性以进行扩展。根据我网站的数据,运行上面的查询表明不准确率可能高达 6%。那个听起来是对的吗?不,我必须弄清楚如何用标准 SQL 重写您的查询,不过可能是另一个问题! @goose 我自己是一名分析师,与谷歌密切合作。可接受的差异率为 5-10%。但我不会阻止你自己编写和检查。如果您有任何疑虑,请告诉我。也许我可以帮忙:) 再次感谢@Tushar - 我并不是说不是,我只是想检查一下我是否理解正确。知道这一点很有用。 Oki 你在问号后面写了 No,它应该是现在 ;) 很抱歉造成混淆:D 啊,我的错。是的错误类型。哦!【参考方案2】:

您只需将以下内容添加到您的选择语句中即可实现相同的目的:

,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage

当我运行类似下面的内容时,我可以与 GA UI 进行 1 对 1 匹配,这比原始答案更简洁:

SELECT DISTINCT
   a.landingpage
  ,COUNT(DISTINCT(a.sessionId)) sessions
  ,SUM(a.bounces) bounces
  ,AVG(a.avg_pages) avg_pages
  ,(SUM(tos)/COUNT(DISTINCT(a.sessionId)))/60 session_duration
FROM
(
    SELECT DISTINCT 
       CONCAT(CAST(fullVisitorId AS STRING),CAST(visitStartTime AS STRING)) sessionId
      ,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
      ,totals.bounces bounces
      ,totals.timeonsite tos
      ,(SELECT COUNT(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE') avg_pages
    FROM `tablename_*`
      WHERE _TABLE_SUFFIX >= '20180801'
       AND _TABLE_SUFFIX <= '20180808'
        AND totals.visits = 1   
) a
GROUP BY 1

【讨论】:

【参考方案3】:

这里有另一种方式!你可以得到相同的数字:

    SELECT 
LandingPage,
COUNT(DISTINCT(sessionID)) AS sessions
FROM(
SELECT    
    CONCAT(fullVisitorId,CAST(visitId AS STRING)) AS sessionID,
    FIRST_VALUE(hits.page.pagePath) OVER (PARTITION BY  CONCAT(fullVisitorId,CAST(visitId AS STRING)) ORDER BY hits.hitNumber ASC ) AS LandingPage
FROM
    `xxxxxxxx1.ga_sessions_*`,
    UNNEST(hits) AS hits
  WHERE
    _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
    AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
    AND hits.type ='PAGE'
GROUP BY fullVisitorId, visitId, sessionID,hits.page.pagePath,hits.hitNumber
)
GROUP BY LandingPage
ORDER BY sessions DESC

【讨论】:

【参考方案4】:

架构中有一个 hit.isEntrance 字段可用于此目的。 下面的示例将向您展示昨天的目标网页:

#standardSQL
select
  date,
  hits.page.pagePath as landingPage,
  sum(totals.visits) as visits,
  sum(totals.bounces) as bounces,
  sum(totals.transactions) as transactions
from
  `project.dataset.ga_sessions_*`,
  unnest(hits) as hits
where
  (_table_suffix
    between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
    and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
  and hits.isEntrance = True
  and totals.visits = 1 #avoid counting midnight-split sessions
group by
  1, 2
order by 3 desc

但仍有一个差异来源,它来自没有着陆页的会话(如果您在着陆页报告中检查 GA,有时会有一个 (not set) 值.

为了也包括这些,您可以这样做:

with
landing_pages_set as (
  select
    concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string)) as fullVisitId,
    hits.page.pagePath as virtualPagePath
  from
    `project.dataset.ga_sessions_*`,
    unnest(hits) as hits
  where
    (_table_suffix
      between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
      and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
    and totals.visits = 1 #avoid counting midnight-split sessions
    and hits.isEntrance = TRUE
  group by 1, 2
),

landing_pages_not_set as (
  select
    concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string)) as fullVisitId,
    date,
    "(not set)" as virtualPagePath,
    count(distinct concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string))) as visits,
    sum(totals.bounces) as bounces,
    sum(totals.transactions) as transactions
  from
    `project.dataset.ga_sessions_*`
  where
    (_table_suffix
      between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
      and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
    and totals.visits = 1 #avoid counting midnight-split sessions
  group by 1, 2, 3
),

landing_pages as (
  select
    l.fullVisitId as fullVisitId,
    date,
    coalesce(r.virtualPagePath, l.virtualPagePath) as virtualPagePath,
    visits,
    bounces,
    transactions
  from
    landing_pages_not_set l left join landing_pages_set r on l.fullVisitId = r.fullVisitId
)

select virtualPagePath, sum(visits) from landing_pages group by 1 order by 2 desc

【讨论】:

以上是关于Big Query 着陆页数字与 Google Analytics 界面不一致的主要内容,如果未能解决你的问题,请参考以下文章

如何将 Google Cloud SQL 与 Google Big Query 集成

Google Big Query 在日期列中期望啥?

Google Big Query 页面查看次数与 GA 页面查看次数不匹配

关于Google Big Query中数据框中的DateTime与DateTime的问题

设计素材|最新设计趋势的响应式设计着陆页模板UI套件

使用 Google Big Query 构建基本漏斗