在 BigQuery 中取消嵌套 JSON 字符串化数组

Posted

技术标签:

【中文标题】在 BigQuery 中取消嵌套 JSON 字符串化数组【英文标题】:Unnest a JSON stringified array in BigQuery 【发布时间】:2020-07-26 12:36:32 【问题描述】:

我在 Google BigQuery 中有下表:

+------------+---------+---------+
|    Name    |  City  | items   |
+------------+---------+
| James     | Dallas   |['text': 'pear', 'line_total_excl_vat': '24','product_id': 100]

| John      | Chicago  |['text': 'apple', 'line_total_excl_vat': '29','product_id': 200,'text': 'banana', 'line_total_excl_vat': '34','product_id': 300]
+------------+---------+

我正在努力实现这样的目标:

+------------+---------+---------+----------------------+--------------+
|    Name    |  City   | text     |  line_total_excl_vat | product_id
+------------+---------+
| James     | Dallas   |  pear    |       24             |       100

| John      | Chicago  |  apple   |       29             |       200

| John      | Chicago  |  banana  |       34             |       300
+------------+---------+

“items”列实际上是一个字符串。有没有办法取消嵌套这种数据格式并在 BigQuery 中实现我想要的视图?谢谢!

【问题讨论】:

你知道列名吗?如果没有,你不能用简单的select 来做到这一点。 是的,我知道列的名称 【参考方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT Name, City, 
  JSON_EXTRACT_SCALAR(json, '$.text') AS text,
  JSON_EXTRACT_SCALAR(json, '$.line_total_excl_vat') AS line_total_excl_vat,
  JSON_EXTRACT_SCALAR(json, '$.product_id') AS product_id
FROM `project.dataset.table`,
UNNEST(JSON_EXTRACT_ARRAY(items,'$')) json   

如果适用于您问题中的示例数据 - 如下例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'James' AS Name, 'Dallas' AS City, "['text': 'pear', 'line_total_excl_vat': '24','product_id': 100]" AS items UNION ALL
  SELECT 'John', 'Chicago', "['text': 'apple', 'line_total_excl_vat': '29','product_id': 200,'text': 'banana', 'line_total_excl_vat': '34','product_id': 300]"
)
SELECT Name, City, 
  JSON_EXTRACT_SCALAR(json, '$.text') AS text,
  JSON_EXTRACT_SCALAR(json, '$.line_total_excl_vat') AS line_total_excl_vat,
  JSON_EXTRACT_SCALAR(json, '$.product_id') AS product_id
FROM `project.dataset.table`,
UNNEST(JSON_EXTRACT_ARRAY(items,'$')) json   

输出是

Row Name    City    text    line_total_excl_vat product_id   
1   James   Dallas  pear    24                  100  
2   John    Chicago apple   29                  200  
3   John    Chicago banana  34                  300  

【讨论】:

【参考方案2】:

json_extract 和 json_extract_array 结合 unnest() 有点摆弄...

WITH t AS (
  SELECT 'James' as Name, 'Dallas' AS City, "['text': 'pear', 'line_total_excl_vat': '24','product_id': 100]" AS items
  UNION ALL
  SELECT 'John', 'Chicago', "['text': 'apple', 'line_total_excl_vat': '29','product_id': 200,'text': 'banana', 'line_total_excl_vat': '34','product_id': 300]"
)

SELECT 
  # we'll unnest this array in the next statement and grab its elements
  JSON_EXTRACT_ARRAY(items,'$') as arr
  
  # unnest() turns array into table format - jason-function extracts fields from each row
  ,ARRAY(SELECT AS STRUCT
  
      JSON_EXTRACT_SCALAR(i,'$.text') as text,
      JSON_EXTRACT_SCALAR(i,'$.line_total_excl_vat') as line_total_excl_vat,
      JSON_EXTRACT_SCALAR(i,'$.product_id') as product_id
   
   FROM UNNEST(JSON_EXTRACT_ARRAY(items,'$')) as i 
   ) AS unnested_items
   ,* # original fields for reference
FROM t

这将创建一个嵌套输出,您可以稍后使用(请参阅输出的 JSON 表示,它更清楚) - 如果您想展平表格,您可以横向连接这个结果数组...

WITH t AS (
#     Name    |  City  | items   |
  SELECT 'James' as Name, 'Dallas' AS City, "['text': 'pear', 'line_total_excl_vat': '24','product_id': 100]" AS items
  UNION ALL
  SELECT 'John', 'Chicago', "['text': 'apple', 'line_total_excl_vat': '29','product_id': 200,'text': 'banana', 'line_total_excl_vat': '34','product_id': 300]"
)

SELECT 
   * 
FROM t CROSS JOIN UNNEST(ARRAY((SELECT AS STRUCT
  
      JSON_EXTRACT_SCALAR(i,'$.text') as text,
      JSON_EXTRACT_SCALAR(i,'$.line_total_excl_vat') as line_total_excl_vat,
      JSON_EXTRACT_SCALAR(i,'$.product_id') as product_id
   
   FROM UNNEST(JSON_EXTRACT_ARRAY(items,'$')) as i 
   )))

【讨论】:

以上是关于在 BigQuery 中取消嵌套 JSON 字符串化数组的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery 中未嵌套的 json 对象的字符串化数组

如何在 BigQuery 中取消嵌套多个数组?

如何在 BigQuery 标准 SQL 中取消嵌套多个数组

在字符串列 bigquery 中查询 json

如何在存储为字符串的 bigquery 字段中取消嵌套多个数组?

BigQuery 取消嵌套数组 - 获取重复项