将一个大的 json 文件拆分为多个较小的文件

Posted 2023-02-16

技术标签:

【中文标题】将一个大的 json 文件拆分为多个较小的文件【英文标题】：Split a large json file into multiple smaller files 【发布时间】：2017-08-21 19:09:18 【问题描述】：

我有一个大的 JSON 文件，大约 500 万条记录和大约 32GB 的文件大小，我需要将其加载到我们的雪花数据仓库中。我需要把这个文件分解成每个文件大约 200k 条记录（大约 1.25GB）的块。我想在 Node.JS 或 Python 中执行此操作以部署到 AWS Lambda 函数，不幸的是我还没有编写任何代码。我有 C# 和大量 SQL 经验，并且学习 node 和 python 都在我的待办事项清单上，所以为什么不直接潜入，对吧！？

我的第一个问题是“哪种语言更适合这个功能？Python 还是 Node.JS？”

我知道我不想将整个 JSON 文件读入内存（甚至是输出 smaller 文件）。我需要能够根据记录数（200k）将其“流式传输”到和到新文件中，正确关闭 json 对象，然后继续进入另一个 200k 的新文件，等等。我知道 Node 可以做到这一点，但如果 Python 也可以做到这一点，我觉得快速开始使用其他 ETL 东西会更容易，我很快就会做。

我的第二个问题是“根据您上面的建议，您能否推荐一下我应该需要/导入哪些模块来帮助我入门？主要是因为它与不将整个 json 文件拉入内存有关？也许是一些提示、技巧，或者“你会怎么做？如果你真的很慷慨，一些代码示例可以帮助我深入了解这个？

我无法包含 JSON 数据的样本，因为它包含个人信息。但我可以提供 JSON 模式 ...


  "$schema": "http://json-schema.org/draft-04/schema#",
  "items": 
    "properties": 
      "activities": 
        "properties": 
          "activity_id": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "frontlineorg_id": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "import_id": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "is_source": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "address": 
        "properties": 
          "city": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "congress_dist_name": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "congress_dist_number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "congress_end_yr": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "congress_number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "congress_start_yr": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "county": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "formatted": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "latitude": 
            "items": 
              "type": "number"
            ,
            "type": "array"
          ,
          "longitude": 
            "items": 
              "type": "number"
            ,
            "type": "array"
          ,
          "number": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "observes_dst": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "post_directional": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "pre_directional": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "school_district": 
            "items": 
              "properties": 
                "school_dist_name": 
                  "items": 
                    "type": "string"
                  ,
                  "type": "array"
                ,
                "school_dist_type": 
                  "items": 
                    "type": "string"
                  ,
                  "type": "array"
                ,
                "school_grade_high": 
                  "items": 
                    "type": "string"
                  ,
                  "type": "array"
                ,
                "school_grade_low": 
                  "items": 
                    "type": "string"
                  ,
                  "type": "array"
                ,
                "school_lea_code": 
                  "items": 
                    "type": "integer"
                  ,
                  "type": "array"
                
              ,
              "type": "object"
            ,
            "type": "array"
          ,
          "secondary_number": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "secondary_unit": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "state": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "state_house_dist_name": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "state_house_dist_number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "state_senate_dist_name": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "state_senate_dist_number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "street": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suffix": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "timezone": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "utc_offset": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "zip": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "age": 
        "type": "integer"
      ,
      "anniversary": 
        "properties": 
          "date": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "baptism": 
        "properties": 
          "church_id": 
            "type": "null"
          ,
          "date": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "birth_dd": 
        "type": "integer"
      ,
      "birth_mm": 
        "type": "integer"
      ,
      "birth_yyyy": 
        "type": "integer"
      ,
      "church_attendance": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "likelihood": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "cohabiting": 
        "properties": 
          "confidence": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "likelihood": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "dating": 
        "properties": 
          "bool": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "divorced": 
        "properties": 
          "bool": 
            "items": 
              "type": "null"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "likelihood_considering": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "education": 
        "properties": 
          "est_level": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "email": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "is_work_school": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "engaged": 
        "properties": 
          "insert_datetime_utc": 
            "type": "null"
          ,
          "likelihood": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "est_income": 
        "properties": 
          "est_level": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "ethnicity": 
        "type": "string"
      ,
      "first_name": 
        "type": "string"
      ,
      "formatted_birthdate": 
        "type": "string"
      ,
      "gender": 
        "type": "string"
      ,
      "head_of_household": 
        "properties": 
          "bool": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "home_church": 
        "properties": 
          "church_id": 
            "type": "null"
          ,
          "group_participant": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "is_coaching": 
            "type": "null"
          ,
          "is_giving": 
            "type": "null"
          ,
          "is_serving": 
            "type": "null"
          ,
          "membership_date": 
            "type": "null"
          ,
          "regular_attendee": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "hub_poid": 
        "type": "integer"
      ,
      "insert_datetime_utc": 
        "type": "string"
      ,
      "ip_address": 
        "properties": 
          "insert_datetime_utc": 
            "type": "null"
          ,
          "string": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "last_name": 
        "type": "string"
      ,
      "marriage_segment": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "married": 
        "properties": 
          "bool": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "middle_name": 
        "type": "string"
      ,
      "miscellaneous": 
        "properties": 
          "attribute": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "value": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "name_suffix": 
        "type": "null"
      ,
      "name_title": 
        "type": "null"
      ,
      "newlywed": 
        "properties": 
          "bool": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "parent": 
        "properties": 
          "bool": 
            "items": 
              "type": "boolean"
            ,
            "type": "array"
          ,
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "likelihood_expecting": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "person_id": 
        "type": "integer"
      ,
      "phone": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "number": 
            "items": 
              "type": "integer"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "type": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "property_rights": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "psychographic_cluster": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "religion": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "religious_segment": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      ,
      "separated": 
        "properties": 
          "bool": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "significant_other": 
        "properties": 
          "first_name": 
            "type": "null"
          ,
          "insert_datetime_utc": 
            "type": "null"
          ,
          "last_name": 
            "type": "null"
          ,
          "middle_name": 
            "type": "null"
          ,
          "name_suffix": 
            "type": "null"
          ,
          "name_title": 
            "type": "null"
          ,
          "suppressed_datetime_utc": 
            "type": "null"
          
        ,
        "type": "object"
      ,
      "suppressed_datetime_utc": 
        "type": "string"
      ,
      "target_group": 
        "properties": 
          "insert_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "string": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          ,
          "suppressed_datetime_utc": 
            "items": 
              "type": "string"
            ,
            "type": "array"
          
        ,
        "type": "object"
      
    ,
    "type": "object"
  ,
  "type": "array"

【问题讨论】：

您的 JSON 格式有什么特别之处吗？例如，每条记录是否都在新行上，或者每条记录是否都以仅包含的行开头并以结尾，并且内部有缩进？如果是这样，一个简单的文件解析脚本可能会有所帮助:) 我按每个有效组拆分 JSON 的代码是 csplit -n 6 -f <FILE_NAME>_ <FILE> '/\(?:[^|(?R)])*\/' -f 只是在输出文件中添加了一个前缀另见***.com/questions/68718175/…，拆分JSON/CSV并同时压缩 【参考方案1】：

在 linux 命令提示符中使用此代码

split -b 53750k <your-file>
cat xa* > <your-file>

参考这个链接： https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file

【讨论】：

xa* 应该是什么？ xa* 是默认生成的新拆分文件名。你可以做一个 ls -lrt 仅当您想看一眼您的 JSON 结构而不进一步使用它时，因为您会丢失文件结构【参考方案2】：

回答 Python 还是 Node 是否更适合该任务的问题将是一种意见，我们不允许在 Stack Overflow 上发表我们的意见。你必须自己决定你在哪些方面有更多的经验以及你想使用什么——Python 或 Node。

如果您使用 Node，则有一些模块可以帮助您完成该任务，这些模块可以进行流式 JSON 解析。例如。这些模块：

https://www.npmjs.com/package/JSONStream https://www.npmjs.com/package/stream-json https://www.npmjs.com/package/json-stream

如果您使用 Python，这里也有流式 JSON 解析器：

https://github.com/kashifrazzaqui/json-streamer https://github.com/danielyule/naya http://www.enricozini.org/blog/2011/tips/python-stream-json/

【讨论】：

【参考方案3】：

考虑使用 jq 来预处理你的 json 文件

它可以拆分和流式传输您的大型 json 文件

jq is like sed for JSON data - you can use it to slice 
and filter and map and transform structured data with 
the same ease that sed, awk, grep and friends let you play with text.

请参阅official documentation 和此questions 了解更多信息。

额外：对于您的第一个问题，jq 是用 C 编写的，它比 python/node 更快，不是吗？

【讨论】：

【参考方案4】：

Snowflake 有一个very special treatment for JSON，如果我们了解它们，就很容易绘制设计。

在将 JSON 数据加载到阶段位置时，标记 strip_outer_array=true

copy into <table> from @~/<file>.json file_format = (type = 'JSON' strip_outer_array = true);

在雪花中加载时，每行压缩后的大小不能超过 16Mb。

使用utilities，它可以根据每行拆分文件，文件大小超过 100Mb，从而为您的数据带来并行性和准确性。

根据您的数据集大小，您将获得大约 31K 小文件（100Mb 大小）。

表示31k并行进程运行，但是，不可能。所以选择x-large size的仓库（16 v-core & 32 threads） 31k/32 =（大约）1000 发根据您的网络带宽加载数据不会超过几分钟。即使我们考虑每轮 3 秒，它也可能在 50 分钟内加载数据。

查看仓库配置&throughput详情参考semi-structured data loading best practice。

【讨论】：

【参考方案5】：

对我来说最简单的方法是：

json_file = <your_file>
chunks = 200
for i in range(0,len(json_file), chunks):
    print(json_file[i:i+chunks])

【讨论】：

您的答案可以通过添加有关代码的作用以及它如何帮助 OP 的更多信息来改进。【参考方案6】：

同时使用 bash 进行拆分和压缩，生成每个 ~100MB 的文件：

cat bigfile.json  | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'

查看更多：https://***.com/a/68718176/132438

【讨论】：

以上是关于将一个大的 json 文件拆分为多个较小的文件的主要内容，如果未能解决你的问题，请参考以下文章