从 python 生成 Faker 数据并将其加载到 BigQuery 嵌套表中
Posted
技术标签:
【中文标题】从 python 生成 Faker 数据并将其加载到 BigQuery 嵌套表中【英文标题】:Generate Faker data from python and load it into BigQuery nested table 【发布时间】:2020-02-20 09:12:06 【问题描述】:我想为我的测试创建虚拟数据。因此,我使用faker
创建了一些虚拟数据,然后将该obj 加载到pandas 数据框中。但我的目标 BigQuery 表有嵌套数组。
而且我需要对 faker 对象进行一些计算,例如如果 destination
为 'sometext'
则 route
应该为空,否则添加 origin
和 destination
。
下面是我现有的代码。
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
from faker import Factory
import random
import uuid
import string
import datetime
from datetime import date
def test():
return 'user_uuid':uuid.uuid4(),
'origin':random.choice(airport) ,
'destination':random.choice(airport),
'route': 'origin' + 'destination',
'app': 'version':'','model':'name':'','id':'','type':'',
'passenger':'title':'','firstname':'','lastname':'',
'datetime':'',
example_dummy_data = pd.DataFrame([test() for _ in range(2)])
pandas_gbq.to_gbq(example_dummy_data, 'dataset.table', project_id='project', if_exists='append')
我的表结构
[
"name": "user_uuid",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "origin",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "destination",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "route",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "app",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
"name": "version",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "model",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "name",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "type",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "customer",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
"name": "title",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "firstname",
"type": "STRING",
"mode": "NULLABLE"
,
"name": "lastname",
"type": "STRING",
"mode": "NULLABLE"
]
,
"name": "datetime",
"type": "TIMESTAMP",
"mode": "NULLABLE"
]
我需要一些更好的建议来实现这一点。
【问题讨论】:
【参考方案1】:pandas_gbq
库不支持嵌套或重复字段。
您可以尝试使用支持这些字段的google-cloud-bigquery
库。您可以找到有关如何使用此库 here 将 DataFrame 加载到 BigQuery 的详细信息。
【讨论】:
感谢您的回复。我已经用 GCP 库替换了 pandas。dataset_ref = client.dataset('DATASET'], 'project') table_ref = dataset_ref.table('TABLE') table = client.get_table(table_ref) # Make an API request. rows_to_insert = [ func() for _ in range(num)] errors = client.insert_rows(table, rows_to_insert) # Make an API request. if errors == []: print("New rows have been added.")
那么,使用google-cloud-bigquery
是否有效?如果是这样,您可以通过接受答案来表明它。如果没有,您有什么问题?以上是关于从 python 生成 Faker 数据并将其加载到 BigQuery 嵌套表中的主要内容,如果未能解决你的问题,请参考以下文章
Python:随机生成测试数据的模块--faker的基本使用