如何将嵌套的 json 导入谷歌大查询
Posted
技术标签:
【中文标题】如何将嵌套的 json 导入谷歌大查询【英文标题】:How to import nested json into google big query 【发布时间】:2019-09-30 18:28:13 【问题描述】:我正在将 JSON 插入 Google Big Query。 问题的底部是 JSON 的架构。
以下是 JSON 示例:
"_index":"data",
"_type":"collection_v1",
"_id":"548d035f23r8987b768a5e60",
"_score":1,
"_source":
"fullName":"Mike Smith",
"networks":[
"id":[
"12923449"
],
"network":"facebook",
"link":"https://www.facebook.com/127654449"
],
"sex":
"network":"facebook",
"value":"male"
,
"interests":
,
"score":1.045,
"merged_by":"548f899444v5t4v45te9a4cc"
如您所见,有一个带有“Mike Smith”的“_source.fullName”字段。 当我尝试用它创建一个表时,它会出错:
为非重复字段指定的数组:_source.fullName。
我相信这个字段是 _source 的一次性字段。我该如何克服这个错误?
这是架构:
[
"name": "_index",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "_id",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "_type",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "score",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "header",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "fullName",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "src",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "avatar",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "merged_by",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "cover",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "sex",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "_source",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
"name": "fullName",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "links",
"type": "STRING",
"mode": "REPEATED"
,
"name": "birthday",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "phones",
"type": "STRING",
"mode": "REPEATED"
,
"name": "pictures",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "url",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "tab",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "contacts",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "fullName",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "tag",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "groups",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "Name",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "skills",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "relations",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "about",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "emails",
"type": "STRING",
"mode": "REPEATED"
,
"name": "languages",
"type": "STRING",
"mode": "REPEATED"
,
"name": "places",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "type",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "education",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "school",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "experience",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "start",
"type": "NUMERIC",
"mode": "NULLABLE"
,
"name": "company",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "title",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "networks",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "link",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "id",
"type": "STRING",
"mode": "REPEATED"
]
,
"name": "network",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "others",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "tag",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "books",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "tag",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "music",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "tag",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "games",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "tag",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "spotify",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "network",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "tag",
"type": "STRING",
"mode": "NULLABLE"
]
]
]
]
【问题讨论】:
首先想知道:这是否已经是换行符分隔的 JSON,并且您将其张贴在此处以方便阅读? 是的,先生。每行一个 json。我只是为此打印了它。 【参考方案1】:您可以像导入 CSV 一样导入完整的 json 行 - 基本上是一个包含 json 对象的单列 BigQuery 表。然后你可以在 BigQuery 中随意解析 JSON,查询如下:
WITH j AS (
SELECT """"_index":"data","_type":"collection_v1","_id":"548d035f23r8987b768a5e60","_score":1,"_source":"fullName":"Mike Smith","networks":["id":["12923449"],"network":"facebook","link":"https://www.facebook.com/127654449"],"sex":"network":"facebook","value":"male","interests":,"score":1.045,"merged_by":"548f899444v5t4v45te9a4cc"""" j
)
SELECT index
, STRUCT(
JSON_EXTRACT_SCALAR(source, '$.fullName') AS fullName
, [
STRUCT(
JSON_EXTRACT_SCALAR(source, '$.networks[0].id[0]') AS id
, JSON_EXTRACT_SCALAR(source, '$.networks[0].network') AS network
, JSON_EXTRACT_SCALAR(source, '$.networks[0].link') AS link)
] AS networks
) source
FROM (
SELECT JSON_EXTRACT_SCALAR(j.j, '$._index') index
, JSON_EXTRACT(j.j, '$._source') source
FROM j
)
见:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6【讨论】:
好主意,但文件太大,无法导入为 CSV。它出错了。Array specified for non-repeated field: _source.fullName.
以 JSON 格式加载时会出现该错误。当您加载为 CSV 时,它会给您什么错误?
文件太大。现在想剪掉它
如何使用 CSV 而不是 JSON 获得该错误?两者都有相同的限制cloud.google.com/bigquery/quotas#load_jobs以上是关于如何将嵌套的 json 导入谷歌大查询的主要内容,如果未能解决你的问题,请参考以下文章