Google BigQuery SQL:从 JSON(列表和数组)中提取数据到列中

Posted

技术标签:

【中文标题】Google BigQuery SQL:从 JSON(列表和数组)中提取数据到列中【英文标题】:Google BigQuery SQL: Extract data from JSON (list and array) into columns 【发布时间】:2021-03-14 12:58:44 【问题描述】:

我有一个带有 json 字符串的表

UserID  json_string
100      ["id": 77379513, "value": "35.4566", "os_type": null, "amount": "200", "created_at": "2020-08- 
           16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same']
100      ["id": 77379514, "value": "38.658", "os_type": null, "amount": "100", "created_at": "2020-08- 
         16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko']
100      ["id": 77379515, "value": "40.569", "os_type": null, "amount": "150", "created_at": "2020-08- 
         16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko']
200      ["id": 77378899, "value": "25.365", "os_type": null, "amount": "100", "created_at": "2020-08- 
         16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same']
200      ["id": 77378900, "value": "35.898", "os_type": null, "amount": "500", "created_at": "2020-08- 
          16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko']
200      ["id": 77378901, "value": "41.258", "os_type": null, "amount": "400", "created_at": "2020-08- 
         16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko']

最后,我需要将字符串转换为列:

UserID  ID         value    os_type   amount    created_at                  updated_at                  Type_name
100    77379513    35.4566  null    200    2020-08-16T14:48:27.611-04:00    2020-08-16T14:48:27.611-04:00   same
100    77379514    38.658   null    100    2020-08-16T14:48:27.611-04:00    2020-08-16T14:48:27.611-04:01   niko
100    77379515    40.569   null    150    2020-08-16T14:48:27.611-04:00    2020-08-16T14:48:27.611-04:02   koko
200    77378899   25.365    null    100    2020-09-16T14:48:27.611-04:01    2020-08-17T14:48:27.611-04:03   same
200    77378900   35.898    null    500    2020-09-16T14:48:27.611-04:02    2020-08-17T14:48:27.611-04:04   niko
200    77378901   41.258    null    400    2020-09-16T14:48:27.611-04:03    2020-08-17T14:48:27.611-04:05   koko

首先我尝试从列表中提取 JSON:

SELECT iUserID,json_extract_array(json_string) as json_array
FROM `project.dataset.table` 

然后我得到一个这样的表:

UserID                              json_array
100     "id": 77379513, "value": "35.4566", "os_type": null, "amount": "200", "created_at": "2020-08- 
         16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same'
100     "id": 77379514, "value": "38.658", "os_type": null, "amount": "100", "created_at": "2020-08- 
        16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko'
100     "id": 77379515, "value": "40.569", "os_type": null, "amount": "150", "created_at": "2020-08- 
        16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko'
200     "id": 77378899, "value": "25.365", "os_type": null, "amount": "100", "created_at": "2020-09- 
        16T14:48:27.611-04:00", "updated_at": "2020-08-17T14:48:27.836-04:00", "Type_name": "same'
200     "id": 77378900, "value": "35.898", "os_type": null, "amount": "500", "created_at": "2020-09- 
        16T14:48:27.611-04:00", "updated_at": "2020-08-17T14:48:27.836-04:00", "Type_name": "niko'
200     "id": 77378901, "value": "41.258", "os_type": null, "amount": "400", "created_at": "2020-09- 
        16T14:48:27.611-04:00", "updated_at": "2020-08-17T14:48:27.836-04:00", "Type_name": "koko'

从这一步开始,我尝试使用函数 JSON_EXTRACT_SCALAR,但我收到一个错误,指出此函数不适用于数组。 那么将数据提取到列的正确方法是什么?

【问题讨论】:

出于好奇,为什么要将这些数据存储在 JSON 中?看起来每个条目都有相同的字段。为什么不直接创建一个表,其中包含与这些字段名称相同的真实列? 顺便说一句,我想知道为什么由于语法突出显示,行的颜色会交替变化,我注意到您在一个地方使用了',而不是"。请记住,这些引号字符在 JSON 中不可互换。您必须始终使用" 【参考方案1】:

以下内容对你有用

select UserID, 
  json_extract_scalar(json, '$.id') as id,
  json_extract_scalar(json, '$.value') as value,
  json_extract_scalar(json, '$.os_type') as os_type,
  json_extract_scalar(json, '$.amount') as amount,
  json_extract_scalar(json, '$.created_at') as created_at,
  json_extract_scalar(json, '$.updated_at') as updated_at,
  json_extract_scalar(json, '$.Type_name') as Type_name
from `project.dataset.table`,
unnest(json_extract_array(json_string, '$')) json       

如果适用于您问题中的示例数据

with `project.dataset.table` as (
  select 100 UserID, '["id": 77379513, "value": "35.4566", "os_type": null, "amount": "200", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same"]' json_string union all
  select 100, '["id": 77379514, "value": "38.658", "os_type": null, "amount": "100", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko"]' union all
  select 100, '["id": 77379515, "value": "40.569", "os_type": null, "amount": "150", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko"]' union all
  select 200, '["id": 77378899, "value": "25.365", "os_type": null, "amount": "100", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "same"]' union all
  select 200, '["id": 77378900, "value": "35.898", "os_type": null, "amount": "500", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "niko"]' union all
  select 200, '["id": 77378901, "value": "41.258", "os_type": null, "amount": "400", "created_at": "2020-08-16T14:48:27.611-04:00", "updated_at": "2020-08-16T14:48:27.836-04:00", "Type_name": "koko"]' 
)

输出是

注意:您在少数地方使用了' 而不是",所以这在上面使用的示例数据中是“固定的”

如果您无法控制此表中的值并且无法将' 修复为",您可以使用下面的代替

select UserID, 
  json_extract_scalar(json, '$.id') as id,
  json_extract_scalar(json, '$.value') as value,
  json_extract_scalar(json, '$.os_type') as os_type,
  json_extract_scalar(json, '$.amount') as amount,
  json_extract_scalar(json, '$.created_at') as created_at,
  json_extract_scalar(json, '$.updated_at') as updated_at,
  json_extract_scalar(json, '$.Type_name') as Type_name
from `project.dataset.table`,
unnest(json_extract_array(replace(json_string, "'", '"'), '$')) json 

注意unnest 内部的更改,它解决了' 的问题

【讨论】:

以上是关于Google BigQuery SQL:从 JSON(列表和数组)中提取数据到列中的主要内容,如果未能解决你的问题,请参考以下文章

Google BigQuery 从 Python 脚本执行 SQL 文件

从 Google BigQuery 标准 SQL 中的数组生成随机值

从 Google BigQuery 导出到 CloudSQL?

Google BigQuery SQL:从 JSON(列表和数组)中提取数据到列中

从 Google 脚本将数据插入 BigQuery:遇到“”

根据google BigQuery SQL中的属性删除重复行