如何在存储为字符串的 bigquery 字段中取消嵌套多个数组?

Posted

技术标签:

【中文标题】如何在存储为字符串的 bigquery 字段中取消嵌套多个数组?【英文标题】:How to unnest multiple arrays in bigquery field that is stored as a string? 【发布时间】:2019-01-21 19:46:52 【问题描述】:

我需要帮助解除嵌套对象中包含多个数组和相同字段的字段的嵌套(数量在嵌套对象的外部和内部重复)。

数据集中有 2 个字段:order_id 和购物车,其中购物车是一个字典对象,其中包含多个列表,其中包括列表“项目”中的列表,但购物车的数据类型是字符串。我希望输出是每个产品和类别的单独行。

使用部分工作查询的示例数据。

#standardSQL
WITH t AS (
    SELECT "order1234" as order_id, ' "_id" : "cart1234" , "taxRate" : 0.0 , "items" : [ "quantity" : 1 , "product" :  "_id" : "prod1" , "categoryIds" : [ "cat1", "cat2", "cat3"] , "name" : "Product 1" , "pricing" :  "listPrice" :  "value" : 899 , "salePrice" :  "value" : 725, "imagedata" :  "imageLink" :  "_id" : "img1" , "createdOn" :  "$date" : "2019-01-19T19:55:19.782Z" , "revision" : 1 , "title" : "" , "description" : "" , "altText" : "" , "variants" : [ ] , "productVariants" : [  "_id" : "var1" , "sku" :  "value" : "sku1" , "modifier" : 0 , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" :  "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null ] , "Shipping" : true ,  "quantity" : 2 , "product" :  "_id" : "prod2" , "categoryIds" : [ "cat2", "cat4"] , "name" : "Product 2" , "pricing" :  "listPrice" :  "value" : 199 , "salePrice" :  "value" : 150, "imagedata" :  "imageLink" :  "_id" : "img2" , "createdOn" :  "$date" : "2019-01-19T19:58:11.484Z" ,  "revision" : 1 , "title" : "" , "description" : "" , "altText" : "" , "variants" : [ ] , "productVariants" : [  "_id" : "var2" , "sku" :  "value" : "sku2" , "modifier" : 0 , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" :  "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null ] , "Shipping" : true ,  "quantity" : 3 , "product" :  "_id" : "prod3" , "categoryIds" : [ "cat2","cat4","cat5"] , "name" : "Product 3" , "pricing" :  "listPrice" :  "value" : 499 , "salePrice" :  "value" : 325, "imagedata" :  "imageLink" :  "_id" : "img3" , "createdOn" :  "$date" : "2019-01-15T05:34:17.556Z" , "revision" : 3 , "title" : "" , "description" : "" , "altText" : "" , "variants" : [ ] , "productVariants" : [  "_id" : "var3" , "sku" :  "value" : "sku3" , "modifier" : 0 , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" :  "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null ], "Shipping" : true ]' as cart
)
select order_id, quantity, product, JSON_EXTRACT_SCALAR(product,'$._id') as product_id, REPLACE(category_id, '"', '') category_id, 
JSON_EXTRACT_SCALAR(product,'$.pricing.listPrice.value') as product_list_price,
JSON_EXTRACT_SCALAR(product,'$.pricing.salePrice.value') as product_sale_price
from t,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"categoryIds" : \[(.+?)]')) categoryIds WITH OFFSET pos1,
UNNEST(SPLIT(categoryIds)) category_id,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"product" : (.*?)\')) product WITH OFFSET pos2,
UNNEST(REGEXP_EXTRACT_ALL(cart, r'"quantity" : (.+?)')) quantity WITH OFFSET pos3
where pos1= pos2 and pos1 = pos3

在上述查询中,数量字段不正确,product_list_price 广告 product_sale_price 未显示。请记住嵌套元素中重复的数量。我认为我的正则表达式是错误的,不知何故我需要在每个“项目”中选择第一个“数量”,而对于价格,我的产品正则表达式并没有给我完整的产品字典,这就是为什么它们返回为空。知道产品密钥中可能有多个 ,获取产品密钥完整值的正确正则表达式是什么?

预期结果

order_id  quantity  product_id  category_id  product_list_price   product_sale_price
order1234    1     prod1        cat1             899                 799
order1234    1     prod1        cat2             899                 799
order1234    1     prod1        cat3             899                 799
order1234    2     prod2        cat2             199                 150
order1234    2     prod2        cat4             199                 150
order1234    3     prod3        cat2             499                 399 
order1234    3     prod3        cat4             499                 399
order1234    3     prod3        cat5             499                 399

【问题讨论】:

【参考方案1】:

知道产品密钥中可能有多个 ,获取产品密钥完整值的正确正则表达式是什么?

理想情况下,应该使用 JSON_EXTRACT(而不是 REGEXP_EXTRACT - 这会使事情变得过于复杂)。但不幸的是 BigQuery 的 JSON_EXTRACT 有一些限制,不允许处理 JSON 数组

要克服 JsonPath 的 BigQuery“限制”,您可以使用 custom function,如下例所示: 它使用 jsonpath-0.8.0.js,可以从 https://code.google.com/archive/p/jsonpath/downloads 下载并上传到 Google Cloud Storage - gs://your_bucket/jsonpath-0.8.0.js

#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
        return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
    library="gs://your_bucket/jsonpath-0.8.0.js"
);
SELECT order_id, quantity, product_id, category_id
FROM `project.dataset.table`,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].quantity')) quantity WITH OFFSET pos1,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product._id')) product_id WITH OFFSET pos2,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product.categoryIds')) category_ids WITH OFFSET pos3,
UNNEST(SPLIT(category_ids)) category_id
WHERE pos1 = pos2 AND pos1 = pos3

您可以使用您提供的示例数据进行测试,使用上述内容:

#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
        return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
    library="gs://your_bucket/jsonpath-0.8.0.js"
);
WITH t AS (
    SELECT "order1234" AS order_id, ''' "_id" : "cart1234" , "taxRate" : 0.0 , "items" : [
       "quantity" : 1 , "product" :  "_id" : "prod1" , "categoryIds" : [ "cat1", "cat2", "cat3"] , "name" : "Product 1" , "imagedata" :  "imageLink" :  "_id" : "img1" , "createdOn" :  "$date" : "2019-01-19T19:55:19.782Z" , "revision" : 1 , "title" : "" , "description" : "" , "altText" : "" , "variants" : [ ] , "productVariants" : [  "_id" : "var1" , "sku" :  "value" : "sku1" , "modifier" : 0 , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" :  "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null ] , "Shipping" : true , 
       "quantity" : 2 , "product" :  "_id" : "prod2" , "categoryIds" : [ "cat2", "cat4"] , "name" : "Product 2" , "imagedata" :  "imageLink" :  "_id" : "img2" , "createdOn" :  "$date" : "2019-01-19T19:58:11.484Z" ,  "revision" : 1 , "title" : "" , "description" : "" , "altText" : "" , "variants" : [ ] , "productVariants" : [  "_id" : "var2" , "sku" :  "value" : "sku2" , "modifier" : 0 , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" :  "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null ] , "Shipping" : true , 
       "quantity" : 3 , "product" :  "_id" : "prod3" , "categoryIds" : [ "cat2","cat4","cat5"] , "name" : "Product 3" , "imagedata" :  "imageLink" :  "_id" : "img3" , "createdOn" :  "$date" : "2019-01-15T05:34:17.556Z" , "revision" : 3 , "title" : "" , "description" : "" , "altText" : "" , "variants" : [ ] , "productVariants" : [  "_id" : "var3" , "sku" :  "value" : "sku3" , "modifier" : 0 , "variants" : [ ] , "quantity" : 0 , "imageLinkIds" : [ ] , "skuImageLinkIds" : [ ] , "fulfillmentData" :  "sourceName" :  null  , "sourceId" :  null  , "sourceSku" :  null  , "sourceMethod" :  null  , "sourceRedirectUrl" :  null  , "sourceRedirectAppKey" :  null ], "Shipping" : true 
    ]''' AS cart    
)
SELECT order_id, quantity, product_id, category_id
FROM t,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].quantity')) quantity WITH OFFSET pos1,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product._id')) product_id WITH OFFSET pos2,
UNNEST(CUSTOM_JSON_EXTRACT(cart, '$.items[*].product.categoryIds')) category_ids WITH OFFSET pos3,
UNNEST(SPLIT(category_ids)) category_id
WHERE pos1 = pos2 AND pos1 = pos3

结果

Row order_id    quantity    product_id  category_id  
1   order1234   1           prod1       cat1     
2   order1234   1           prod1       cat2     
3   order1234   1           prod1       cat3     
4   order1234   2           prod2       cat2     
5   order1234   2           prod2       cat4     
6   order1234   3           prod3       cat2     
7   order1234   3           prod3       cat4     
8   order1234   3           prod3       cat5     

注意:product_list_priceproduct_sale_price 不存在于您的示例 daat 中,因此它不在上述结果中。但是现在查询非常干净和简单,所以希望您能够轻松添加那些

【讨论】:

以上是关于如何在存储为字符串的 bigquery 字段中取消嵌套多个数组?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 BigQuery 中取消嵌套多个数组?

取消嵌套存储在列中的 JSON 字符串 [BigQuery]

在 BigQuery 中取消嵌套多个嵌套字段

如何在 bigquery 中查询数组?

查询 Bigquery 重复字段

Bigquery:UNNEST 重复与展平表性能