json 列上的聚合
Posted
技术标签:
【中文标题】json 列上的聚合【英文标题】:aggregation on json column 【发布时间】:2015-12-18 13:17:02 【问题描述】:我有一个带有 JSON 对象集合的字符串列的表。假设对象是单词。
我想汇总选择最流行的单词(例如 map-reduce 示例)。数据不在 Bigquery 的嵌套记录中。我知道我需要使用 JSON_EXTRACT。
例如: 用户名词
123 ""totalItems":2,"items":["word":"drink","word":"food"]", 第456章 ““totalItems”:3,“items”:[“word”:“food”,“word”:“dog”,“word”:“drink”]”, 123 ""totalItems":1,"items":["word":"drink"] "
结果应该是: 3 喝 2 食物 1条狗
如果我按用户分组,它会是: 用户标识字数 123 2 喝, 123 1 食物, 456 1个食物,....等等......
提前致谢
【问题讨论】:
【参考方案1】:字词:
SELECT id, word, COUNT(1) AS cnt FROM (
SELECT id, REGEXP_EXTRACT(item, r':"(\w+)"') AS word,
FROM (
SELECT id, SPLIT(JSON_EXTRACT(items, '$.items')) AS item
FROM
(SELECT 123 AS id, '"totalItems":2,"items":["word":"drink","word":"food"]' AS items),
(SELECT 456 AS id, '"totalItems":3,"items":["word":"food","word":"dog","word":"drink"]' AS items),
(SELECT 123 AS id, '"totalItems":1,"items":["word":"drink"]' AS items)
)
)
GROUP BY id, word
按用户、字词:
SELECT word, COUNT(1) AS cnt FROM (
SELECT REGEXP_EXTRACT(item, r':"(\w+)"') AS word,
FROM (
SELECT SPLIT(JSON_EXTRACT(items, '$.items')) AS item
FROM
(SELECT 123 AS id, '"totalItems":2,"items":["word":"drink","word":"food"]' AS items),
(SELECT 456 AS id, '"totalItems":3,"items":["word":"food","word":"dog","word":"drink"]' AS items),
(SELECT 123 AS id, '"totalItems":1,"items":["word":"drink"]' AS items)
)
)
GROUP BY word
【讨论】:
【参考方案2】:米哈伊尔的回答很好!请注意,由于 JSON_EXTRACT 函数不能很好地处理数组,因此需要进行一些调整,这些调整是使用 SPLIT 和 REGEXP_EXTRACT 执行的。
如果您想使用 BigQuery javascript UDF 的替代方法:
SELECT userid, word, COUNT(*) c
FROM (
SELECT * FROM
js(
// I wish you had given me a sample table instead when asking the question
(SELECT * FROM
(SELECT 123 AS id, '"totalItems":2,"items":["word":"drink","word":"food"]' AS items),
(SELECT 456 AS id, '"totalItems":3,"items":["word":"food","word":"dog","word":"drink"]' AS items),
(SELECT 123 AS id, '"totalItems":1,"items":["word":"drink"]' AS items)
),
// Input columns.
id, items,
// Output schema.
"[name: 'word', type:'string',
name: 'userid', type:'integer']",
// The function.
"function(r, emit)
x=JSON.parse(r.items)
x.items.forEach(function(entry)
emit(word:entry.word, userid:r.id);
);
"
)
)
GROUP BY 1,2
【讨论】:
两者看起来都不错,Felipe,为了清楚起见,您写的 REGEX_EXTRACT & Spilled 需要调整?这些是什么?无论如何,JS UDF 会更好地工作吗?非常感谢 我的意思是,看看 Mikhail 的查询需要的 REGEXP_EXTRACT(item, r':"(\w+)"') 。我可能也会这样做,除非“单词”中的字符串不仅仅是简单的单词。然后我会使用 UDF。以上是关于json 列上的聚合的主要内容,如果未能解决你的问题,请参考以下文章
如何对多个列上的数据集进行分组并同时进行不同的聚合? Python
Pandas Groupby:同一列上的聚合,但总计基于两个不同的标准/数据框
如何在一个列上进行分组,在另一个列上聚合数组并创建一个由分组列作为键的 JSON 对象
MySQL:加入表并根据另一列上的聚合函数从一行返回一列[重复]