如何在 BigQuery 中取消嵌套多个数组?
Posted
技术标签:
【中文标题】如何在 BigQuery 中取消嵌套多个数组?【英文标题】:How to UNNest multiple arrays in BigQuery? 【发布时间】:2018-05-31 04:32:53 【问题描述】:我有这个 json,它存储在 BigQuery 表中的 3 个字段令牌、问题、答案中
令牌:STRING,问题:STRING,答案:STRING
问题和答案是STRING
,因为它们是动态字段。
token 字段只有一个值。
questions字段有dictionary
对象,“fields”是list
对象,有3个问题。
answers 字段是一个 list
对象,其中包含 3 个问题的答案,id
将用于将问题与答案进行匹配。下面是从 bigquery 下载的 JSON 文件
token questions answers
18e6d8e445 "fields": ["id": "L39FyvUohKDV", "properties": , "ref": "d8834652-3acf-4541-8354-1e3dcd716667", "title": "What did you think about the changes?", "type": "short_text", "id": "krs82KgxHwGb", "properties": , "ref": "5b6e6796-635b-4595-9404-e81617d4540b", "title": "How useful is this feature turning out to be for you?", "type": "opinion_scale", "id": "lBzHtCuzHFM4", "properties": , "ref": "b76be913-19b9-4b8a-b2ac-3fb645a65a5c", "title": "Your email address", "type": "email"], "id": "SdzXVn", "title": "Google Shopping 5/4/18" ["field": "id": "L39FyvUohKDV", "type": "short_text", "text": "t", "type": "text", "field": "id": "krs82KgxHwGb", "type": "opinion_scale", "number": 10, "type": "number", "email": "t@t.com", "field": "id": "lBzHtCuzHFM4", "type": "email", "type": "email"]
949b2c57e3 "fields": ["id": "krs82KgxHwGb", "properties": , "ref": "5b6e6796-635b-4595-9404-e81617d4540b", "title": "How useful is this feature turning out to be for you?", "type": "opinion_scale", "id": "lBzHtCuzHFM4", "properties": , "ref": "b76be913-19b9-4b8a-b2ac-3fb645a65a5c", "title": "Your email address", "type": "email", "id": "L39FyvUohKDV", "properties": , "ref": "d8834652-3acf-4541-8354-1e3dcd716667", "title": "What did you think about the changes?", "type": "short_text"], "id": "SdzXVn", "title": "Google Shopping 5/4/18" ["field": "id": "krs82KgxHwGb", "type": "opinion_scale", "number": 10, "type": "number", "email": "someone@mail.com", "field": "id": "lBzHtCuzHFM4", "type": "email", "type": "email", "field": "id": "L39FyvUohKDV", "type": "short_text", "text": "they were awesome", "type": "text"]
146c49cdd6 "fields": ["id": "CxhfK22a3XWE", "properties": , "ref": "d8834652-3acf-4541-8354-1e3dcd716667", "title": "What did you think about the changes?", "type": "short_text", "id": "oUZxPRaKjmFr", "properties": , "ref": "5b6e6796-635b-4595-9404-e81617d4540b", "title": "How useful is this feature turning out to be for you?", "type": "opinion_scale", "id": "zUIP73oXpLD6", "properties": , "ref": "b76be913-19b9-4b8a-b2ac-3fb645a65a5c", "title": "Your email address", "type": "email"], "id": "kaiAsx", "title": "a - b" ["field": "id": "CxhfK22a3XWE", "type": "short_text", "text": "nice", "type": "text", "field": "id": "oUZxPRaKjmFr", "type": "opinion_scale", "number": 2, "type": "number", "email": "foo@bar.com", "field": "id": "zUIP73oXpLD6", "type": "email", "type": "email"]
@mikhail-berlyant 在下面提供了这个查询,这让我非常接近我的预期。我唯一遇到的问题是我无法得到答案。
SELECT distinct token, id, title AS question,
JSON_EXTRACT_SCALAR(CONCAT('',a,''), '$.type') answer_type
--REPLACE(REGEXP_EXTRACT(b, r'"type":".+?"\s*,\s*".+?":(.+)'), '"', '') answer
FROM `v1-dev-main.typeform.responses`,
UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(definition, '$.fields'), r'"title":"(.+?)"')) title WITH OFFSET pos1,
UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(definition, '$.fields'), r'"id":"(.+?)"')) id WITH OFFSET pos2,
UNNEST(REGEXP_EXTRACT_ALL(answers, r'"field": (.+?)')) a WITH OFFSET pos3
--UNNEST(REGEXP_EXTRACT_ALL(answers, r'(.+?),\s*"field":.+?')) b WITH OFFSET pos4
WHERE pos1 = pos2
--AND pos3 = pos4
AND id = JSON_EXTRACT_SCALAR(CONCAT('',a,''), '$.id')
这是上面查询的结果
token id question answer_type
146c43c81cd5780839d3cdd6 zUIP73oXpLD6 Your email address email
146c493c1cd5780839d3cdd6 oUZxPRaKjmFr How useful is this feature turning out to be for you? opinion_scale
146c493c05d5780839d3cdd6 CxhfK22a3XWE What did you think about the changes? short_text
18e6d8e33df44a1aa451b445 lBzHtCuzHFM4 Your email address email
18e6d8e33df44a1aa451b445 L39FyvUohKDV What did you think about the changes? short_text
18e6d0fa014bfa1aa451b445 krs82KgxHwGb How useful is this feature turning out to be for you? opinion_scale
a63b20df691c9a949b2c57e3 krs82KgxHwGb How useful is this feature turning out to be for you? opinion_scale
a63b20df691c9a949b2c57e3 lBzHtCuzHFM4 Your email address email
a63b258ce0339a949b2c57e3 L39FyvUohKDV What did you think about the changes? short_text
现在,我只是想念答案。
【问题讨论】:
【参考方案1】:以下示例针对 BigQuery 标准 SQL,并根据这些 json 字符串的格式对您的数据进行了一些假设 - 因此很可能需要对正则表达式进行一些调整。但它适用于以下虚拟数据
#standardSQL
WITH `project.dataset.table` AS (
SELECT 12345 token,
'''"fields": [
"id":"1","title":"Question 1?",
"id":"2","title":"Questions 2?",
"id":"3","title":"Question 3?"
]''' questions,
'''[
"type":"text", "text":"answer 1", "field":"id":"1", "type":"short_text",
"type":"number", "number":42, "field":"id":"2", "type":"opinion_scale",
"type":"email", "email":"an_account@example.com", "field":"id":"3", "type":"email"
]''' answers
)
SELECT token, id, title AS question,
JSON_EXTRACT_SCALAR(CONCAT('',a,''), '$.type') answer_type,
REPLACE(REGEXP_EXTRACT(b, r'"type":".+?"\s*,\s*".+?":(.+)'), '"', '') answer
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(questions, '$.fields'), r'"title":"(.+?)"')) title WITH OFFSET pos1,
UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(questions, '$.fields'), r'"id":"(.+?)"')) id WITH OFFSET pos2,
UNNEST(REGEXP_EXTRACT_ALL(answers, r'"field":(.+?)')) a WITH OFFSET pos3,
UNNEST(REGEXP_EXTRACT_ALL(answers, r'(.+?),\s*"field":.+?')) b WITH OFFSET pos4
WHERE pos1 = pos2
AND pos3 = pos4
AND id = JSON_EXTRACT_SCALAR(CONCAT('',a,''), '$.id')
结果为
Row token id question answer_type answer
1 12345 1 Question 1? short_text answer 1
2 12345 2 Questions 2? opinion_scale 42
3 12345 3 Question 3? email an_account@example.com
根据以下cmets更新
#standardSQL
WITH `project.dataset.table` AS (
SELECT "12345" token, '"fields": ["id":"1","title":"Question 1?","id":"2","title":"Questions 2?","id":"3","title":"Question 3?"]' questions,'[ "type":"text", "text":"answer 1", "field":"id":"1", "type":"short_text","type":"number", "number":42, "field":"id":"2", "type":"opinion_scale","type":"email", "email":"an_account@example.com", "field":"id":"3", "type":"email"]' answers UNION ALL
SELECT "18e6d8e33df440fa014bfa1aa451b445", '"fields": ["id": "L39FyvUohKDV", "properties": , "ref": "d8834652-3acf-4541-8354-1e3dcd716667", "title": "What did you think about the changes?", "type": "short_text", "id": "krs82KgxHwGb", "properties": , "ref": "5b6e6796-635b-4595-9404-e81617d4540b", "title": "How useful is this feature turning out to be for you?", "type": "opinion_scale", "id": "lBzHtCuzHFM4", "properties": , "ref": "b76be913-19b9-4b8a-b2ac-3fb645a65a5c", "title": "Your email address", "type": "email"], "id": "SdzXVn", "title": "Google Shopping 5/4/18"', '["field": "id": "L39FyvUohKDV", "type": "short_text", "text": "t", "type": "text", "field": "id": "krs82KgxHwGb", "type": "opinion_scale", "number": 10, "type": "number", "email": "t@t.com", "field": "id": "lBzHtCuzHFM4", "type": "email", "type": "email"]"' UNION ALL
SELECT "a63b258ce03360df691c9a949b2c57e3", '"fields": ["id": "krs82KgxHwGb", "properties": , "ref": "5b6e6796-635b-4595-9404-e81617d4540b", "title": "How useful is this feature turning out to be for you?", "type": "opinion_scale", "id": "lBzHtCuzHFM4", "properties": , "ref": "b76be913-19b9-4b8a-b2ac-3fb645a65a5c", "title": "Your email address", "type": "email", "id": "L39FyvUohKDV", "properties": , "ref": "d8834652-3acf-4541-8354-1e3dcd716667", "title": "What did you think about the changes?", "type": "short_text"], "id": "SdzXVn", "title": "Google Shopping 5/4/18"', '["field": "id": "krs82KgxHwGb", "type": "opinion_scale", "number": 10, "type": "number", "email": "someone@mail.com", "field": "id": "lBzHtCuzHFM4", "type": "email", "type": "email", "field": "id": "L39FyvUohKDV", "type": "short_text", "text": "they were awesome", "type": "text"]"' UNION ALL
SELECT "146c493c051a0a481cd5780839d3cdd6", '"fields": ["id": "CxhfK22a3XWE", "properties": , "ref": "d8834652-3acf-4541-8354-1e3dcd716667", "title": "What did you think about the changes?", "type": "short_text", "id": "oUZxPRaKjmFr", "properties": , "ref": "5b6e6796-635b-4595-9404-e81617d4540b", "title": "How useful is this feature turning out to be for you?", "type": "opinion_scale", "id": "zUIP73oXpLD6", "properties": , "ref": "b76be913-19b9-4b8a-b2ac-3fb645a65a5c", "title": "Your email address", "type": "email"], "id": "kaiAsx", "title": "a - b"', '["field": "id": "CxhfK22a3XWE", "type": "short_text", "text": "nice", "type": "text", "field": "id": "oUZxPRaKjmFr", "type": "opinion_scale", "number": 2, "type": "number", "email": "foo@bar.com", "field": "id": "zUIP73oXpLD6", "type": "email", "type": "email"]"'
)
SELECT token, id, title AS question,
JSON_EXTRACT_SCALAR(CONCAT('',a,''), '$.type') answer_type,
COALESCE(JSON_EXTRACT_SCALAR(b, '$.text'),JSON_EXTRACT_SCALAR(b, '$.number'),JSON_EXTRACT_SCALAR(b, '$.email')) AS answer
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(questions, '$.fields'), r'"title":\s*"(.+?)"')) title WITH OFFSET pos1,
UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(questions, '$.fields'), r'"id":\s*"(.+?)"')) id WITH OFFSET pos2,
UNNEST(REGEXP_EXTRACT_ALL(answers, r'"field":\s*(.+?)')) a WITH OFFSET pos3,
UNNEST(REGEXP_EXTRACT_ALL(REGEXP_REPLACE(answers, r'"field":\s*.+?', '"field": ""'), r'.+?')) b WITH OFFSET pos4
WHERE pos1 = pos2
AND pos3 = pos4
AND id = JSON_EXTRACT_SCALAR(CONCAT('',a,''), '$.id')
输出是
Row token id question answer_type answer
1 12345 1 Question 1? short_text answer 1
2 12345 2 Questions 2? opinion_scale 42
3 12345 3 Question 3? email an_account@example.com
4 18e6d8e33df440fa014bfa1aa451b445 L39FyvUohKDV What did you think about the changes? short_text t
5 18e6d8e33df440fa014bfa1aa451b445 krs82KgxHwGb How useful is this feature turning out to be for you? opinion_scale 10
6 18e6d8e33df440fa014bfa1aa451b445 lBzHtCuzHFM4 Your email address email t@t.com
7 a63b258ce03360df691c9a949b2c57e3 krs82KgxHwGb How useful is this feature turning out to be for you? opinion_scale 10
8 a63b258ce03360df691c9a949b2c57e3 lBzHtCuzHFM4 Your email address email someone@mail.com
9 a63b258ce03360df691c9a949b2c57e3 L39FyvUohKDV What did you think about the changes? short_text they were awesome
10 146c493c051a0a481cd5780839d3cdd6 CxhfK22a3XWE What did you think about the changes? short_text nice
11 146c493c051a0a481cd5780839d3cdd6 oUZxPRaKjmFr How useful is this feature turning out to be for you? opinion_scale 2
12 146c493c051a0a481cd5780839d3cdd6 zUIP73oXpLD6 Your email address email foo@bar.com
【讨论】:
这很有帮助。我知道它适用于我添加的示例 json。但我对真实数据有一些问题。它一直工作到 answer_type 。我无法得到答案。而且,我相信,这是因为答案列表中的键值顺序不一样。有没有办法解决这个问题?谢谢。 如果有帮助请投票。同时,要继续这个问题-您需要提供更好的输入数据示例-正如我在回答中提到的-`它很可能需要对正则表达式进行一些调整。 ...它适用于以下虚拟数据` 当我第一次看到回复时,我确实立即投了赞成票。我会尽快提供更多数据。 所以如果是 - 应该有额外的逻辑允许选择所需的密钥。例如,即使顺序是随机的 - 键的数量始终是三个:字段、类型和第三个,取决于类型。如果是这种情况,这将允许选择那个键。所以你有这样的想法吗?否则 - 我看不到提取确切答案的方法 - 而不是提取具有相应 id 的整个元素 当然。很高兴我能提供帮助。这就是为什么我们在那里。 :o) 要学习的一个重要方面是如何正确/最佳地提出问题,以便您更快、更好地获得答案,最重要的是吸引更多用户回答 - 不仅是像我这样可以在字里行间阅读的用户 :o) 见你在下一篇文章中【参考方案2】:如果您确定数组的长度,可以先对它们进行 ARRAY_CONCAT 并使用串联版本执行 UNNEST。它对我有用。
【讨论】:
以上是关于如何在 BigQuery 中取消嵌套多个数组?的主要内容,如果未能解决你的问题,请参考以下文章
如何在存储为字符串的 bigquery 字段中取消嵌套多个数组?
BigQuery - 如何取消嵌套多个数组,并从一列分配值?
如何在 BigQuery 中取消嵌套重复记录,一个数组给出列名,另一个给出列值?
为啥在 BigQuery 中取消嵌套两个或多个变量时没有得到任何结果?