一次查询扁平化大查询中的事件数据

Posted

技术标签:

【中文标题】一次查询扁平化大查询中的事件数据【英文标题】:Flattening event data in big query with one query 【发布时间】:2013-06-28 12:08:40 【问题描述】:

我们有超过 1 亿行的分析数据大查询。每条记录都是一个附加到 id 的事件。

简化:

ID  EventId  Timestamp

是否可以将其展平为一个包含以下行的表格:

ID timestamp-period event1 event2 event3 event4

事件列在哪里保存该 id 在该时间段内的事件数?

到目前为止,我已经成功地在具有 2 个查询的小型数据集上做到了这一点。一个用于创建包含单个事件 ID 计数的行,另一个用于在之后将它们展平为一行。我还不能对整个数据集执行此操作的原因是 bigquery 耗尽了资源 - 不完全确定原因。

这两个查询看起来像这样:

SELECT 
VideoId,
date_1,
IF(EventId = 1, INTEGER(count), 0) AS user_play,
IF(EventId = 2, INTEGER(count), 0) AS auto_play,
IF(EventId = 3, INTEGER(count), 0) AS pause,
IF(EventId = 4, INTEGER(count), 0) AS replay,
IF(EventId = 5, INTEGER(count), 0) AS stop,
IF(EventId = 6, INTEGER(count), 0) AS seek,
IF(EventId = 7, INTEGER(count), 0) AS resume,
IF(EventId = 11, INTEGER(count), 0) AS progress_25,
IF(EventId = 12, INTEGER(count), 0) AS progress_50,
IF(EventId = 13, INTEGER(count), 0) AS progress_75,
IF(EventId = 14, INTEGER(count), 0) AS progress_90,
IF(EventId = 15, INTEGER(count), 0) AS data_loaded,
IF(EventId = 16, INTEGER(count), 0) AS playback_complete,
IF(EventId = 30, INTEGER(count), 0) AS object_click,
IF(EventId = 31, INTEGER(count), 0) AS object_rollover,
IF(EventId = 32, INTEGER(count), 0) AS object_clickthrough,
IF(EventId = 33, INTEGER(count), 0) AS object_shown,
IF(EventId = 34, INTEGER(count), 0) AS object_close,
IF(EventId = 40, INTEGER(count), 0) AS logo_clickthrough,
IF(EventId = 41, INTEGER(count), 0) AS endframe_clickthrough,
IF(EventId = 42, INTEGER(count), 0) AS startframe_clickthrough,
IF(EventId = 61, INTEGER(count), 0) AS share_facebook,
IF(EventId = 62, INTEGER(count), 0) AS share_twitter,
IF(EventId = 63, INTEGER(count), 0) AS open_social_panel,
IF(EventId = 70, INTEGER(count), 0) AS embed_code_requested,
IF(EventId = 80, INTEGER(count), 0) AS player_impression,
IF(EventId = 81, INTEGER(count), 0) AS player_loaded,
IF(EventId = 90, INTEGER(count), 0) AS html5_impression,
IF(EventId = 91, INTEGER(count), 0) AS html5_load,
IF(EventId = 95, INTEGER(count), 0) AS fallback_impression,
IF(EventId = 96, INTEGER(count), 0) AS fallback_load,
IF(EventId = 152, INTEGER(count), 0) AS object_impression,
IF(EventId = 200, INTEGER(count), 0) AS ping,
IF(EventId = 250, INTEGER(count), 0) AS facebook_clickthrough,
IF(EventId = 251, INTEGER(count), 0) AS twitter_clickthrough,
IF(EventId = 252, INTEGER(count), 0) AS other_clickthrough,
IF(EventId = 253, INTEGER(count), 0) AS qr_clickthrough,
IF(EventId = 254, INTEGER(count), 0) AS banner_clickthrough,
IF(EventId = 280, INTEGER(count), 0) AS banner_impression,
IF(EventId = 281, INTEGER(count), 0) AS banner_loaded,
IF(EventId = 282, INTEGER(count), 0) AS banner_data_loaded,
IF(EventId = 284, INTEGER(count), 0) AS banner_forward,
IF(EventId = 285, INTEGER(count), 0) AS banner_back,
IF(EventId = 300, INTEGER(count), 0) AS mobile_preview_loaded,
IF(EventId = 301, INTEGER(count), 0) AS mobile_preview_clickthrough,
IF(EventId = 302, INTEGER(count), 0) AS mobile_preview_clickthrough_back,
IF(EventId = 310, INTEGER(count), 0) AS product_search_click,
IF(EventId = 311, INTEGER(count), 0) AS promo_code_click,
IF(EventId = 320, INTEGER(count), 0) AS player_share_facebook,
IF(EventId = 321, INTEGER(count), 0) AS player_share_twitter,
IF(EventId = 322, INTEGER(count), 0) AS player_share_googleplus,
IF(EventId = 323, INTEGER(count), 0) AS player_share_email,
IF(EventId = 324, INTEGER(count), 0) AS player_share_embed,
IF(EventId = 401, INTEGER(count), 0) AS youtube_error_2,
IF(EventId = 402, INTEGER(count), 0) AS youtube_error_100,
IF(EventId = 403, INTEGER(count), 0) AS youtube_error_101,
FROM
(
SELECT 
  VideoId, EventId, count(*) as count, Date(timestamp) as date_1  
FROM [data.data_1]
GROUP EACH BY VideoId, EventId, date_1
)
ORDER BY data_loaded DESC;

然后只需对 id 和时间戳进行分组即可创建完整的聚合表。

我这样做的方式是否正确,我是否只需要在数据集的一个小分区上执行此操作,或者是否有更好的方法来像这样以更有效的方式使用 bigquery 进行聚合?

提前致谢, 垫子

【问题讨论】:

【参考方案1】:

我的猜测是,由于最后的 ORDER BY,您的资源即将用完。其他一切都应该能够并行完成。另请注意,如果您删除 order by,您将能够使用“允许大结果”标志并写出一个大的结果表(如果结果大于 128MB)。

【讨论】:

工作就像一个魅力。非常感谢。

以上是关于一次查询扁平化大查询中的事件数据的主要内容,如果未能解决你的问题,请参考以下文章

嵌套结构如何影响 DocumentDB 查询性能?

ES中的数据关联

查询字符串中扁平化 JSON 对象的 Struts2 类型转换

Json - 扁平化 Hive 中的键和值

大数据集上的扁平化+分区与嵌套记录

带有 ORDER BY 的雪花 JSON 扁平化