将一个大的 json 文件拆分为多个较小的文件
Posted
技术标签:
【中文标题】将一个大的 json 文件拆分为多个较小的文件【英文标题】:Split a large json file into multiple smaller files 【发布时间】:2017-08-21 19:09:18 【问题描述】:我有一个大的 JSON 文件,大约 500 万条记录和大约 32GB 的文件大小,我需要将其加载到我们的雪花数据仓库中。我需要把这个文件分解成每个文件大约 200k 条记录(大约 1.25GB)的块。我想在 Node.JS 或 Python 中执行此操作以部署到 AWS Lambda 函数,不幸的是我还没有编写任何代码。我有 C# 和大量 SQL 经验,并且学习 node 和 python 都在我的待办事项清单上,所以为什么不直接潜入,对吧!?
我的第一个问题是“哪种语言更适合这个功能?Python 还是 Node.JS?”
我知道我不想将整个 JSON 文件读入内存(甚至是输出 smaller 文件)。我需要能够根据记录数(200k)将其“流式传输”到 和 到新文件中,正确关闭 json 对象,然后继续进入另一个 200k 的新文件,等等。我知道 Node 可以做到这一点,但如果 Python 也可以做到这一点,我觉得快速开始使用其他 ETL 东西会更容易,我很快就会做。
我的第二个问题是“根据您上面的建议,您能否推荐一下我应该需要/导入哪些模块来帮助我入门?主要是因为它与不将整个 json 文件拉入内存有关?也许是一些提示、技巧,或者“你会怎么做?如果你真的很慷慨,一些代码示例可以帮助我深入了解这个?
我无法包含 JSON 数据的样本,因为它包含个人信息。但我可以提供 JSON 模式 ...
"$schema": "http://json-schema.org/draft-04/schema#",
"items":
"properties":
"activities":
"properties":
"activity_id":
"items":
"type": "integer"
,
"type": "array"
,
"frontlineorg_id":
"items":
"type": "integer"
,
"type": "array"
,
"import_id":
"items":
"type": "integer"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"is_source":
"items":
"type": "boolean"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"address":
"properties":
"city":
"items":
"type": "string"
,
"type": "array"
,
"congress_dist_name":
"items":
"type": "string"
,
"type": "array"
,
"congress_dist_number":
"items":
"type": "integer"
,
"type": "array"
,
"congress_end_yr":
"items":
"type": "integer"
,
"type": "array"
,
"congress_number":
"items":
"type": "integer"
,
"type": "array"
,
"congress_start_yr":
"items":
"type": "integer"
,
"type": "array"
,
"county":
"items":
"type": "string"
,
"type": "array"
,
"formatted":
"items":
"type": "string"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"latitude":
"items":
"type": "number"
,
"type": "array"
,
"longitude":
"items":
"type": "number"
,
"type": "array"
,
"number":
"items":
"type": "string"
,
"type": "array"
,
"observes_dst":
"items":
"type": "boolean"
,
"type": "array"
,
"post_directional":
"items":
"type": "null"
,
"type": "array"
,
"pre_directional":
"items":
"type": "null"
,
"type": "array"
,
"school_district":
"items":
"properties":
"school_dist_name":
"items":
"type": "string"
,
"type": "array"
,
"school_dist_type":
"items":
"type": "string"
,
"type": "array"
,
"school_grade_high":
"items":
"type": "string"
,
"type": "array"
,
"school_grade_low":
"items":
"type": "string"
,
"type": "array"
,
"school_lea_code":
"items":
"type": "integer"
,
"type": "array"
,
"type": "object"
,
"type": "array"
,
"secondary_number":
"items":
"type": "null"
,
"type": "array"
,
"secondary_unit":
"items":
"type": "null"
,
"type": "array"
,
"state":
"items":
"type": "string"
,
"type": "array"
,
"state_house_dist_name":
"items":
"type": "string"
,
"type": "array"
,
"state_house_dist_number":
"items":
"type": "integer"
,
"type": "array"
,
"state_senate_dist_name":
"items":
"type": "string"
,
"type": "array"
,
"state_senate_dist_number":
"items":
"type": "integer"
,
"type": "array"
,
"street":
"items":
"type": "string"
,
"type": "array"
,
"suffix":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"timezone":
"items":
"type": "string"
,
"type": "array"
,
"utc_offset":
"items":
"type": "integer"
,
"type": "array"
,
"zip":
"items":
"type": "integer"
,
"type": "array"
,
"type": "object"
,
"age":
"type": "integer"
,
"anniversary":
"properties":
"date":
"type": "null"
,
"insert_datetime_utc":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"baptism":
"properties":
"church_id":
"type": "null"
,
"date":
"type": "null"
,
"insert_datetime_utc":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"birth_dd":
"type": "integer"
,
"birth_mm":
"type": "integer"
,
"birth_yyyy":
"type": "integer"
,
"church_attendance":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"likelihood":
"items":
"type": "integer"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"cohabiting":
"properties":
"confidence":
"items":
"type": "string"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"likelihood":
"items":
"type": "null"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"dating":
"properties":
"bool":
"type": "null"
,
"insert_datetime_utc":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"divorced":
"properties":
"bool":
"items":
"type": "null"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"likelihood_considering":
"items":
"type": "integer"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"education":
"properties":
"est_level":
"items":
"type": "string"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"email":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"is_work_school":
"items":
"type": "boolean"
,
"type": "array"
,
"string":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"engaged":
"properties":
"insert_datetime_utc":
"type": "null"
,
"likelihood":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"est_income":
"properties":
"est_level":
"items":
"type": "string"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"ethnicity":
"type": "string"
,
"first_name":
"type": "string"
,
"formatted_birthdate":
"type": "string"
,
"gender":
"type": "string"
,
"head_of_household":
"properties":
"bool":
"type": "null"
,
"insert_datetime_utc":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"home_church":
"properties":
"church_id":
"type": "null"
,
"group_participant":
"type": "null"
,
"insert_datetime_utc":
"type": "null"
,
"is_coaching":
"type": "null"
,
"is_giving":
"type": "null"
,
"is_serving":
"type": "null"
,
"membership_date":
"type": "null"
,
"regular_attendee":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"hub_poid":
"type": "integer"
,
"insert_datetime_utc":
"type": "string"
,
"ip_address":
"properties":
"insert_datetime_utc":
"type": "null"
,
"string":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"last_name":
"type": "string"
,
"marriage_segment":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"string":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"married":
"properties":
"bool":
"items":
"type": "boolean"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"middle_name":
"type": "string"
,
"miscellaneous":
"properties":
"attribute":
"items":
"type": "string"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"value":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"name_suffix":
"type": "null"
,
"name_title":
"type": "null"
,
"newlywed":
"properties":
"bool":
"type": "null"
,
"insert_datetime_utc":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"parent":
"properties":
"bool":
"items":
"type": "boolean"
,
"type": "array"
,
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"likelihood_expecting":
"items":
"type": "integer"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"person_id":
"type": "integer"
,
"phone":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"number":
"items":
"type": "integer"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"property_rights":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"string":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"psychographic_cluster":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"string":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"religion":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"string":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"religious_segment":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"string":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"separated":
"properties":
"bool":
"type": "null"
,
"insert_datetime_utc":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"significant_other":
"properties":
"first_name":
"type": "null"
,
"insert_datetime_utc":
"type": "null"
,
"last_name":
"type": "null"
,
"middle_name":
"type": "null"
,
"name_suffix":
"type": "null"
,
"name_title":
"type": "null"
,
"suppressed_datetime_utc":
"type": "null"
,
"type": "object"
,
"suppressed_datetime_utc":
"type": "string"
,
"target_group":
"properties":
"insert_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"string":
"items":
"type": "string"
,
"type": "array"
,
"suppressed_datetime_utc":
"items":
"type": "string"
,
"type": "array"
,
"type": "object"
,
"type": "object"
,
"type": "array"
【问题讨论】:
您的 JSON 格式有什么特别之处吗?例如,每条记录是否都在新行上,或者每条记录是否都以仅包含
的行开头并以
结尾,并且内部有缩进?如果是这样,一个简单的文件解析脚本可能会有所帮助:)
我按每个有效组拆分 JSON 的代码是 csplit -n 6 -f <FILE_NAME>_ <FILE> '/\(?:[^|(?R)])*\/'
-f
只是在输出文件中添加了一个前缀
另见***.com/questions/68718175/…,拆分JSON/CSV并同时压缩
【参考方案1】:
在 linux 命令提示符中使用此代码
split -b 53750k <your-file>
cat xa* > <your-file>
参考这个链接: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file
【讨论】:
xa*
应该是什么?
xa* 是默认生成的新拆分文件名。你可以做一个 ls -lrt
仅当您想看一眼您的 JSON 结构而不进一步使用它时,因为您会丢失文件结构【参考方案2】:
回答 Python 还是 Node 是否更适合该任务的问题将是一种意见,我们不允许在 Stack Overflow 上发表我们的意见。你必须自己决定你在哪些方面有更多的经验以及你想使用什么——Python 或 Node。
如果您使用 Node,则有一些模块可以帮助您完成该任务,这些模块可以进行流式 JSON 解析。例如。这些模块:
https://www.npmjs.com/package/JSONStream https://www.npmjs.com/package/stream-json https://www.npmjs.com/package/json-stream如果您使用 Python,这里也有流式 JSON 解析器:
https://github.com/kashifrazzaqui/json-streamer https://github.com/danielyule/naya http://www.enricozini.org/blog/2011/tips/python-stream-json/【讨论】:
【参考方案3】:考虑使用 jq 来预处理你的 json 文件
它可以拆分和流式传输您的大型 json 文件
jq is like sed for JSON data - you can use it to slice
and filter and map and transform structured data with
the same ease that sed, awk, grep and friends let you play with text.
请参阅official documentation 和此questions 了解更多信息。
额外:对于您的第一个问题,jq 是用 C 编写的,它比 python/node 更快,不是吗?
【讨论】:
【参考方案4】:Snowflake 有一个very special treatment for JSON,如果我们了解它们,就很容易绘制设计。
-
JSON/Parquet/Avro/XML 被视为半结构数据
它们在 Snowflake 中存储为 Variant 数据类型。
在将 JSON 数据加载到阶段位置时,标记 strip_outer_array=true
copy into <table>
from @~/<file>.json
file_format = (type = 'JSON' strip_outer_array = true);
在雪花中加载时,每行压缩后的大小不能超过 16Mb。
雪花数据加载works well 如果文件大小在 10-100Mb 的范围内分割。使用utilities,它可以根据每行拆分文件,文件大小超过 100Mb,从而为您的数据带来并行性和准确性。
根据您的数据集大小,您将获得大约 31K 小文件(100Mb 大小)。
表示31k并行进程运行,但是,不可能。 所以选择x-large size的仓库(16 v-core & 32 threads) 31k/32 =(大约)1000 发 根据您的网络带宽加载数据不会超过几分钟。即使我们考虑每轮 3 秒,它也可能在 50 分钟内加载数据。查看仓库配置&throughput详情参考semi-structured data loading best practice。
【讨论】:
【参考方案5】:对我来说最简单的方法是:
json_file = <your_file>
chunks = 200
for i in range(0,len(json_file), chunks):
print(json_file[i:i+chunks])
【讨论】:
您的答案可以通过添加有关代码的作用以及它如何帮助 OP 的更多信息来改进。【参考方案6】:同时使用 bash 进行拆分和压缩,生成每个 ~100MB 的文件:
cat bigfile.json | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'
查看更多:https://***.com/a/68718176/132438
【讨论】:
以上是关于将一个大的 json 文件拆分为多个较小的文件的主要内容,如果未能解决你的问题,请参考以下文章
rails 4:将 routes.rb 拆分为多个较小的文件
使用 java 中的 OLE 自动化将 word 文件拆分为多个较小的 word 文件