如何使用 pandas 转换 csv 中的嵌套 json

Posted 2023-02-23

技术标签:

【中文标题】如何使用 pandas 转换 csv 中的嵌套 json【英文标题】：How to convert nested json in csv with pandas 【发布时间】：2022-01-20 09:26:49 【问题描述】：

我有一个嵌套的 json 文件（10 万行），如下所示：

"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":["DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1","DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"]

我正在尝试创建一个 csv，以便可以轻松地将其加载到 rdbms 中。我正在尝试在 pandas 中使用 json_normalize() 但即使在我到达那里之前我也遇到了错误。

with open('transactions.json') as data_file:    
    data = json.load(data_file)

JSONDecodeError: Extra data: line 2 column 1 (char 466)

【问题讨论】：

问题：您的数据文件是否包含 100k 行，每行都有一个单独的有效 JSON 记录，还是里面都是一个很长的 JSON 结构？每个都有单独的有效 json 记录。在 excel 中打开时，每行看起来像一行所有记录是否都具有您的示例中显示的结构？将这种嵌套结构转换为平面 CSV 将是一项挑战，您必须决定要对 "Segment" 列表做什么 - 它是否进入一个单元格？它的每个元素是否都进入自己的单元格？你想对每个元素中的键值对做什么？每个段应该转到自己的单元格。我已编辑我的答案以提供更完整的解决方案。 【参考方案1】：

如果您的问题源于读取 json 文件本身，那么我会使用：

json.loads()

然后使用

pd.read_csv()

如果您的问题源于从 json dict 到数据框的转换，您可以使用：

test = "UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":["DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1","DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"]

import json
import pandas

# convert json to string and read
df = pd.read_json(json.dumps(test), convert_axes=True)

# 'unpack' the dict as series and merge them with original df
df = pd.concat([df, df.Segment.apply(pd.Series)], axis=1)

# remove dict
df.drop('Segment', axis=1, inplace=True)

这将是我的方法，但可能有更方便的方法。

【讨论】：

您好，问题出在从您的 json dict 到 dataframe 的转换中。您的解决方案适用于单行，但我有一个包含 100k 行的文件，类似于我显示的行。读取这些行时失败。那么你应该遵循@joanis 的方法，循环遍历单行来读取它们。【参考方案2】：

第一步：遍历记录文件

由于您的文件每行有一个 JSON 记录，因此您需要遍历文件中的所有记录，您可以这样做：

with open('transactions.json', encoding="utf8") as data_file:
    for line in data_file:
        data = json.loads(line) 
        # or
        df = pd.read_json(line, convert_axes=True)
        # do something with data or df

第二步：写CSV文件

现在，您可以将其与 csv.writer 结合使用，将文件转换为 CSV 文件。

with open('transactions.csv', "w", encoding="utf8") as csv_file:
    writer = csv.writer(csv_file)
    #Loop for each record, somehow:
        #row = build list with row contents
        writer.writerow(row)

把它们放在一起

我将读取第一条记录以获取键以将其显示并输出为 CSV 标头，然后我将读取整个文件并一次将其转换为一条记录：

import copy
import csv
import json
import pandas as pd

# Read the first JSON record to get the keys that we'll use as headers for the CSV file
with open('transactions.json', encoding="utf8") as data_file:
    keys = list(json.loads(next(data_file)).keys())

# Our CSV headers are going to be the keys from the first row, except for
# segments, which we'll replace (arbitrarily) by three numbered segment column
# headings.
keys.pop()
base_keys = copy.copy(keys)
keys.extend(["Segment1", "Segment2", "Segment3"])

with open('transactions.csv', "w", encoding="utf8") as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(keys)  # Write the CSV headers

    with open('transactions.json', encoding="utf8") as data_file:
        for line in data_file:
            data = json.loads(line)
            row = [data[k] for k in base_keys] + data["Segment"]
            writer.writerow(row)

生成的 CSV 文件在每个 Segmenti 列中仍会有一条 JSON 记录。如果你想以不同的方式格式化每个段，你可以定义一个 format_segment(segment) 函数并用这个列表理解替换 data["Segment"]：[format_segment(segment) for segment in data["Segment"]]

【讨论】：

以上是关于如何使用 pandas 转换 csv 中的嵌套 json的主要内容，如果未能解决你的问题，请参考以下文章

如何自动将csv转换为pandas？

如何在使用 pandas.read_csv 读取 csv 文件时将 pandas.dataframe 中的元素转换为 np.float？

如何将 csv 字符串转换为 pandas 中的列表？

使用 Pandas 在巨大的 CSV 中解析带有嵌套值的 JSON 列