Google BigQuery SQL:从 JSON(列表和数组)中提取数据到列中
Posted
技术标签:
【中文标题】Google BigQuery SQL:从 JSON(列表和数组)中提取数据到列中【英文标题】:Google BigQuery SQL: Extract data from JSON (list and array) into columns 【发布时间】:2021-03-14 12:58:44 【问题描述】:我有一个带有 json 字符串的表
UserID json_string
100 ["id": 77379513, "value": "35.4566", "os_type": null, "amount": "200", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same']
100 ["id": 77379514, "value": "38.658", "os_type": null, "amount": "100", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko']
100 ["id": 77379515, "value": "40.569", "os_type": null, "amount": "150", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko']
200 ["id": 77378899, "value": "25.365", "os_type": null, "amount": "100", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same']
200 ["id": 77378900, "value": "35.898", "os_type": null, "amount": "500", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko']
200 ["id": 77378901, "value": "41.258", "os_type": null, "amount": "400", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko']
最后,我需要将字符串转换为列:
UserID ID value os_type amount created_at updated_at Type_name
100 77379513 35.4566 null 200 2020-08-16T14:48:27.611-04:00 2020-08-16T14:48:27.611-04:00 same
100 77379514 38.658 null 100 2020-08-16T14:48:27.611-04:00 2020-08-16T14:48:27.611-04:01 niko
100 77379515 40.569 null 150 2020-08-16T14:48:27.611-04:00 2020-08-16T14:48:27.611-04:02 koko
200 77378899 25.365 null 100 2020-09-16T14:48:27.611-04:01 2020-08-17T14:48:27.611-04:03 same
200 77378900 35.898 null 500 2020-09-16T14:48:27.611-04:02 2020-08-17T14:48:27.611-04:04 niko
200 77378901 41.258 null 400 2020-09-16T14:48:27.611-04:03 2020-08-17T14:48:27.611-04:05 koko
首先我尝试从列表中提取 JSON:
SELECT iUserID,json_extract_array(json_string) as json_array
FROM `project.dataset.table`
然后我得到一个这样的表:
UserID json_array
100 "id": 77379513, "value": "35.4566", "os_type": null, "amount": "200", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same'
100 "id": 77379514, "value": "38.658", "os_type": null, "amount": "100", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko'
100 "id": 77379515, "value": "40.569", "os_type": null, "amount": "150", "created_at": "2020-08-
16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko'
200 "id": 77378899, "value": "25.365", "os_type": null, "amount": "100", "created_at": "2020-09-
16T14:48:27.611-04:00", "updated_at": "2020-08-17T14:48:27.836-04:00", "Type_name": "same'
200 "id": 77378900, "value": "35.898", "os_type": null, "amount": "500", "created_at": "2020-09-
16T14:48:27.611-04:00", "updated_at": "2020-08-17T14:48:27.836-04:00", "Type_name": "niko'
200 "id": 77378901, "value": "41.258", "os_type": null, "amount": "400", "created_at": "2020-09-
16T14:48:27.611-04:00", "updated_at": "2020-08-17T14:48:27.836-04:00", "Type_name": "koko'
从这一步开始,我尝试使用函数 JSON_EXTRACT_SCALAR,但我收到一个错误,指出此函数不适用于数组。 那么将数据提取到列的正确方法是什么?
【问题讨论】:
出于好奇,为什么要将这些数据存储在 JSON 中?看起来每个条目都有相同的字段。为什么不直接创建一个表,其中包含与这些字段名称相同的真实列? 顺便说一句,我想知道为什么由于语法突出显示,行的颜色会交替变化,我注意到您在一个地方使用了'
,而不是"
。请记住,这些引号字符在 JSON 中不可互换。您必须始终使用"
。
【参考方案1】:
以下内容对你有用
select UserID,
json_extract_scalar(json, '$.id') as id,
json_extract_scalar(json, '$.value') as value,
json_extract_scalar(json, '$.os_type') as os_type,
json_extract_scalar(json, '$.amount') as amount,
json_extract_scalar(json, '$.created_at') as created_at,
json_extract_scalar(json, '$.updated_at') as updated_at,
json_extract_scalar(json, '$.Type_name') as Type_name
from `project.dataset.table`,
unnest(json_extract_array(json_string, '$')) json
如果适用于您问题中的示例数据
with `project.dataset.table` as (
select 100 UserID, '["id": 77379513, "value": "35.4566", "os_type": null, "amount": "200", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same"]' json_string union all
select 100, '["id": 77379514, "value": "38.658", "os_type": null, "amount": "100", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko"]' union all
select 100, '["id": 77379515, "value": "40.569", "os_type": null, "amount": "150", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko"]' union all
select 200, '["id": 77378899, "value": "25.365", "os_type": null, "amount": "100", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same"]' union all
select 200, '["id": 77378900, "value": "35.898", "os_type": null, "amount": "500", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko"]' union all
select 200, '["id": 77378901, "value": "41.258", "os_type": null, "amount": "400", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko"]'
)
输出是
注意:您在少数地方使用了'
而不是"
,所以这在上面使用的示例数据中是“固定的”
如果您无法控制此表中的值并且无法将'
修复为"
,您可以使用下面的代替
select UserID,
json_extract_scalar(json, '$.id') as id,
json_extract_scalar(json, '$.value') as value,
json_extract_scalar(json, '$.os_type') as os_type,
json_extract_scalar(json, '$.amount') as amount,
json_extract_scalar(json, '$.created_at') as created_at,
json_extract_scalar(json, '$.updated_at') as updated_at,
json_extract_scalar(json, '$.Type_name') as Type_name
from `project.dataset.table`,
unnest(json_extract_array(replace(json_string, "'", '"'), '$')) json
注意unnest
内部的更改,它解决了'
的问题
【讨论】:
以上是关于Google BigQuery SQL:从 JSON(列表和数组)中提取数据到列中的主要内容,如果未能解决你的问题,请参考以下文章
Google BigQuery 从 Python 脚本执行 SQL 文件
从 Google BigQuery 标准 SQL 中的数组生成随机值
从 Google BigQuery 导出到 CloudSQL?
Google BigQuery SQL:从 JSON(列表和数组)中提取数据到列中