如何在 pandas 中读取大的 json？

Posted 2023-03-11

技术标签:

【中文标题】如何在 pandas 中读取大的 json？【英文标题】：How to read a large json in pandas? 【发布时间】：2018-03-29 03:49:57 【问题描述】：

我的代码是：data_review=pd.read_json('review.json') 我有数据review 作为流：


    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0

但我收到以下错误：

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

我的jsonfile不包含任何cmets和3.8G！我只是从这里下载文件练习link

当我使用下面的代码时，抛出同样的错误：

import json
with open('review.json') as json_file:
    data = json.load(json_file)

【问题讨论】：

您的路径/文件参数有问题。确保该文件存在于您从中运行 python 的文件夹中。也许添加更多关于如何调用此脚本以及从何处调用的详细信息。您不能在 json 文件中包含 cmets：***.com/questions/244777/can-comments-be-used-in-json 您可以尝试使用干净的 .json 文件运行代码吗？ @LukasAnsteeg 我很确定由于之前的一些错误，它永远不会解析 json。 @sascha 是的，我已经认真检查过了，但它确实有效。 @LukasAnsteeg 这大概是pandas的read_json的代码。 【参考方案1】：

也许，您正在阅读的文件包含多个 json 对象，而不是 json.load(json_file) 和 pd.read_json('review.json') 方法所期望的单个 json 或数组对象。这些方法应该读取具有单个 json 对象的文件。

从我看到的 yelp 数据集中，您的文件必须包含以下内容：

"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0
"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0
....    
....

and so on.

因此，重要的是要意识到这不是单个 json 数据，而是一个文件中的多个 json 对象。

要将这些数据读入熊猫数据框，以下解决方案应该可以工作：

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

假设数据量很大，我认为你的机器需要相当长的时间才能将数据加载到数据框中。

【讨论】：

对于大 json 文件的任何解决方案，它在 pandas 中每行有一个 json 而没有 forloop？ @devssh，请看下面的答案！只需将lines=True 和chunksize=<something> 传递给pandas.read_json。您仍然需要遍历它返回的 JsonReader 以访问文件内容，但您必须采取类似的方法来避免将整个文件加载到内存中。一些细节：pandas.pydata.org/pandas-docs/stable/user_guide/…【参考方案2】：

如果您不想使用 for 循环，以下应该可以解决问题：

import pandas as pd

df = pd.read_json("foo.json", lines=True)

这将处理您的 json 文件与此类似的情况：

"foo": "bar"
"foo": "baz"
"foo": "qux"

并将其转换为由单列组成的 DataFrame，foo，三行。

您可以在熊猫的docs阅读更多内容

【讨论】：

如果投反对票，请解释为什么这个答案不充分。不知道为什么你被否决了！如果 op 的“json”文件实际上是一个以行分隔的 json 对象列表，那么你的就是一个更干净的解决方案，它充分利用了 pandas。（人们经常混淆这两种类型的“json”......我认为line-delimtied json应该总是有一个.jsonl扩展名）你的也更好，因为如果jsonl文件很大，那么你可以设置一个chunksize 所以你得到一个JsonReader 而不是DataFrame。这可以让您避免将整个 jsonl 文件加载到内存中。（虽然lines=True 是最近的熊猫功能......）请注意，新行分隔的json格式似乎被称为“ndjson”：ndjson.org【参考方案3】：

使用参数 lines=True 和 chunksize=X 将创建一个读取特定行数的阅读器。

然后你必须做一个循环来显示每个块。

这里有一段代码供你理解：

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

块根据您的 json 的长度创建多个块（按行排列）。例如，我有一个包含 X 个对象的 100 000 行 json，如果我做 chunksize = 10 000，我将有 10 个块。

在我给出的代码中，我添加了一个中断，以便只打印第一个块，但如果你删除它，你将一个接一个地得到 10 个块。

【讨论】：

【参考方案4】：

如果您的 json 文件包含多个对象而不是一个对象，则以下内容应该有效：

import json

data = []
for line in open('sample.json', 'r'):
    data.append(json.loads(line))

注意json.load 和json.loads 之间的区别。

json.loads() 需要一个（有效的）JSON 字符串 - 即 "foo": "bar"。因此，如果您的 json 文件看起来像 @Mant1c0r3 提到的那样，那么 json.loads 将是合适的。

【讨论】：

以上是关于如何在 pandas 中读取大的 json？的主要内容，如果未能解决你的问题，请参考以下文章