将一个大的 json 文件拆分为多个较小的文件

Posted

技术标签:

【中文标题】将一个大的 json 文件拆分为多个较小的文件【英文标题】:Split a large json file into multiple smaller files 【发布时间】:2017-08-21 19:09:18 【问题描述】:

我有一个大的 JSON 文件,大约 500 万条记录和大约 32GB 的文件大小,我需要将其加载到我们的雪花数据仓库中。我需要把这个文件分解成每个文件大约 200k 条记录(大约 1.25GB)的块。我想在 Node.JS 或 Python 中执行此操作以部署到 AWS Lambda 函数,不幸的是我还没有编写任何代码。我有 C# 和大量 SQL 经验,并且学习 node 和 python 都在我的待办事项清单上,所以为什么不直接潜入,对吧!?

我的第一个问题是“哪种语言更适合这个功能?Python 还是 Node.JS?”

我知道我不想将整个 JSON 文件读入内存(甚至是输出 smaller 文件)。我需要能够根据记录数(200k)将其“流式传输”到 到新文件中,正确关闭 json 对象,然后继续进入另一个 200k 的新文件,等等。我知道 Node 可以做到这一点,但如果 Python 也可以做到这一点,我觉得快速开始使用其他 ETL 东西会更容易,我很快就会做。

我的第二个问题是“根据您上面的建议,您能否推荐一下我应该需要/导入哪些模块来帮助我入门?主要是因为它与不将整个 json 文件拉入内存有关?也许是一些提示、技巧,或者“你会怎么做?如果你真的很慷慨,一些代码示例可以帮助我深入了解这个?

我无法包含 JSON 数据的样本,因为它包含个人信息。但我可以提供 JSON 模式 ...


  "$schema": "http://json-schema.org/draft-04/schema#",
  "items": 
    "properties": 
      "activities": 
        "properties": 
          "activity_id": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "frontlineorg_id": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "import_id": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "is_source": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "address": 
        "properties": 
          "city": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "congress_dist_name": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "congress_dist_number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "congress_end_yr": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "congress_number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "congress_start_yr": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "county": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "formatted": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "latitude": 
            "items": 
              "type": "number"
            ,
            "type": "array"
          ,
          "longitude": 
            "items": 
              "type": "number"
            ,
            "type": "array"
          ,
          "number": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "observes_dst": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "post_directional": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "pre_directional": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "school_district": 
            "items": 
              "properties": 
                "school_dist_name": 
                  "items": 
                    "type": "string"
                  ,
                  "type": "array"
                ,
                "school_dist_type": 
                  "items": 
                    "type": "string"
                  ,
                  "type": "array"
                ,
                "school_grade_high": 
                  "items": 
                    "type": "string"
                  ,
                  "type": "array"
                ,
                "school_grade_low": 
                  "items": 
                    "type": "string"
                  ,
                  "type": "array"
                ,
                "school_lea_code": 
                  "items": 
                    "type": "integer"
                  ,
                  "type": "array"
                
              ,
              "type": "object"
            ,
            "type": "array"
          ,
          "secondary_number": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "secondary_unit": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "state": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "state_house_dist_name": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "state_house_dist_number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "state_senate_dist_name": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "state_senate_dist_number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "street": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suffix": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "timezone": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "utc_offset": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "zip": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "age": 
        "type": "integer"
      ,
      "anniversary": 
        "properties": 
          "date": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "baptism": 
        "properties": 
          "church_id": 
            "type": "null"
          ,
          "date": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "birth_dd": 
        "type": "integer"
      ,
      "birth_mm": 
        "type": "integer"
      ,
      "birth_yyyy": 
        "type": "integer"
      ,
      "church_attendance": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "likelihood": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "cohabiting": 
        "properties": 
          "confidence": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "likelihood": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "dating": 
        "properties": 
          "bool": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "divorced": 
        "properties": 
          "bool": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "likelihood_considering": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "education": 
        "properties": 
          "est_level": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "email": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "is_work_school": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "engaged": 
        "properties": 
          "insert_datetime_utc": 
            "type": "null"
          ,
          "likelihood": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "est_income": 
        "properties": 
          "est_level": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "ethnicity": 
        "type": "string"
      ,
      "first_name": 
        "type": "string"
      ,
      "formatted_birthdate": 
        "type": "string"
      ,
      "gender": 
        "type": "string"
      ,
      "head_of_household": 
        "properties": 
          "bool": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "home_church": 
        "properties": 
          "church_id": 
            "type": "null"
          ,
          "group_participant": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "is_coaching": 
            "type": "null"
          ,
          "is_giving": 
            "type": "null"
          ,
          "is_serving": 
            "type": "null"
          ,
          "membership_date": 
            "type": "null"
          ,
          "regular_attendee": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "hub_poid": 
        "type": "integer"
      ,
      "insert_datetime_utc": 
        "type": "string"
      ,
      "ip_address": 
        "properties": 
          "insert_datetime_utc": 
            "type": "null"
          ,
          "string": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "last_name": 
        "type": "string"
      ,
      "marriage_segment": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "married": 
        "properties": 
          "bool": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "middle_name": 
        "type": "string"
      ,
      "miscellaneous": 
        "properties": 
          "attribute": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "value": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "name_suffix": 
        "type": "null"
      ,
      "name_title": 
        "type": "null"
      ,
      "newlywed": 
        "properties": 
          "bool": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "parent": 
        "properties": 
          "bool": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "likelihood_expecting": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "person_id": 
        "type": "integer"
      ,
      "phone": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "type": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "property_rights": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "psychographic_cluster": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "religion": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "religious_segment": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "separated": 
        "properties": 
          "bool": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "significant_other": 
        "properties": 
          "first_name": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "last_name": 
            "type": "null"
          ,
          "middle_name": 
            "type": "null"
          ,
          "name_suffix": 
            "type": "null"
          ,
          "name_title": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "suppressed_datetime_utc": 
        "type": "string"
      ,
      "target_group": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      
    ,
    "type": "object"
  ,
  "type": "array"

【问题讨论】:

您的 JSON 格式有什么特别之处吗?例如,每条记录是否都在新行上,或者每条记录是否都以仅包含 的行开头并以 结尾,并且内部有缩进?如果是这样,一个简单的文件解析脚本可能会有所帮助:) 我按每个有效组拆分 JSON 的代码是 csplit -n 6 -f <FILE_NAME>_ <FILE> '/\(?:[^|(?R)])*\/' -f 只是在输出文件中添加了一个前缀 另见***.com/questions/68718175/…,拆分JSON/CSV并同时压缩 【参考方案1】:

在 linux 命令提示符中使用此代码

split -b 53750k <your-file>
cat xa* > <your-file>

参考这个链接: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file

【讨论】:

xa* 应该是什么? xa* 是默认生成的新拆分文件名。你可以做一个 ls -lrt 仅当您想看一眼您的 JSON 结构而不进一步使用它时,因为您会丢失文件结构【参考方案2】:

回答 Python 还是 Node 是否更适合该任务的问题将是一种意见,我们不允许在 Stack Overflow 上发表我们的意见。你必须自己决定你在哪些方面有更多的经验以及你想使用什么——Python 或 Node。

如果您使用 Node,则有一些模块可以帮助您完成该任务,这些模块可以进行流式 JSON 解析。例如。这些模块:

https://www.npmjs.com/package/JSONStream https://www.npmjs.com/package/stream-json https://www.npmjs.com/package/json-stream

如果您使用 Python,这里也有流式 JSON 解析器:

https://github.com/kashifrazzaqui/json-streamer https://github.com/danielyule/naya http://www.enricozini.org/blog/2011/tips/python-stream-json/

【讨论】:

【参考方案3】:

考虑使用 jq 来预处理你的 json 文件

它可以拆分和流式传输您的大型 json 文件

jq is like sed for JSON data - you can use it to slice 
and filter and map and transform structured data with 
the same ease that sed, awk, grep and friends let you play with text.

请参阅official documentation 和此questions 了解更多信息。

额外:对于您的第一个问题,jq 是用 C 编写的,它比 python/node 更快,不是吗?

【讨论】:

【参考方案4】:

Snowflake 有一个very special treatment for JSON,如果我们了解它们,就很容易绘制设计。

    JSON/Parquet/Avro/XML 被视为半结构数据 它们在 Snowflake 中存储为 Variant 数据类型。

    在将 JSON 数据加载到阶段位置时,标记 strip_outer_array=true

    copy into <table> from @~/<file>.json file_format = (type = 'JSON' strip_outer_array = true);

    在雪花中加载时,每行压缩后的大小不能超过 16Mb。

    雪花数据加载works well 如果文件大小在 10-100Mb 的范围内分割。

使用utilities,它可以根据每行拆分文件,文件大小超过 100Mb,从而为您的数据带来并行性和准确性。

根据您的数据集大小,您将获得大约 31K 小文件(100Mb 大小)。

表示31k并行进程运行,但是,不可能。 所以选择x-large size的仓库(16 v-core & 32 threads) 31k/32 =(大约)1000 发 根据您的网络带宽加载数据不会超过几分钟。即使我们考虑每轮 3 秒,它也可能在 50 分钟内加载数据。

查看仓库配置&throughput详情参考semi-structured data loading best practice。

【讨论】:

【参考方案5】:

对我来说最简单的方法是:

json_file = <your_file>
chunks = 200
for i in range(0,len(json_file), chunks):
    print(json_file[i:i+chunks])

【讨论】:

您的答案可以通过添加有关代码的作用以及它如何帮助 OP 的更多信息来改进。【参考方案6】:

同时使用 bash 进行拆分和压缩,生成每个 ~100MB 的文件:

cat bigfile.json  | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'

查看更多:https://***.com/a/68718176/132438

【讨论】:

以上是关于将一个大的 json 文件拆分为多个较小的文件的主要内容,如果未能解决你的问题,请参考以下文章

rails 4:将 routes.rb 拆分为多个较小的文件

使用 java 中的 OLE 自动化将 word 文件拆分为多个较小的 word 文件

拥有一个大的镶木地板文件还是拥有许多较小的镶木地板文件更好?

使用 Python 按行号将大文本文件拆分为较小的文本文件

如何将 mysqldump 的输出拆分为较小的文件?

拆分 WIX 文件