将 JSON 行解包到 pandas 数据框
Posted
技术标签:
【中文标题】将 JSON 行解包到 pandas 数据框【英文标题】:Unpack JSON lines to pandas dataframe 【发布时间】:2021-04-14 17:32:31 【问题描述】:我正在处理 JSON 行格式并尝试将字典对象“解包”到单个列表中。由于它使用列表来保存字典对象,因此我之前没有找到任何处理该问题的帖子。数据看起来是这样的,其中有一堆嵌套字典在一个列表对象中:
0 ['created_at': 'Sun Jun 14 20:20:28 +0000 202...
1 ['created_at': 'Sat Jul 25 22:30:14 +0000 202...
2 ['created_at': 'Sat May 30 02:22:04 +0000 202...
3 ['created_at': 'Tue May 05 16:54:05 +0000 202...
4 ['created_at': 'Sat Jun 20 13:50:23 +0000 202...
...
17453 ['created_at': 'Mon Apr 13 01:01:10 +0000 202...
17454 ['created_at': 'Fri Jul 17 09:00:50 +0000 202...
17455 ['created_at': 'Sun Jun 21 00:51:54 +0000 202...
17456 ['created_at': 'Tue Jun 02 18:23:49 +0000 202...
17457 ['created_at': 'Thu May 28 00:27:01 +0000 202...
我现在尝试的是:
with open('data') as file:
lines = file.read().splitlines()
df_inter = pd.DataFrame(lines)
df_inter.columns = ['json_element']
对于嵌套字典,我会使用此post 提供的pd.json_normalize(df_inter['json_element'].apply(json.loads))
。但是,如何将多个字典对象解压缩到一行中?
编辑
由于数据量大,我提供部分单行数据:
['created_at': 'Sun Jun 14 20:20:28 +0000 2020', 'id': 1272262651100434433, 'id_str': '1272262651100434433', 'truncated': False, 'display_text_range': [0, 243], 'entities': 'hashtags': ['text': 'Tenet', 'indices': [82, 88]], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': ['id': 1272262640753094656, 'id_str': '1272262640753094656', 'indices': [244, 267], 'media_url': 'http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg'...]
【问题讨论】:
这取决于你的字典的结构,列表的每个元素是否具有相同的结构以及有多少嵌套字典。提供更全面的样本数据 谢谢,我已经在帖子中放了一个示例数据。 @Kapocsi 你是对的。我已经编辑了帖子。 【参考方案1】:如果您的data
文件如下所示:
["created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": "hashtags": ["text": "Tenet", "indices": [82, 88]], "symbols": [], "user_mentions": [], "urls": [], "media": ["id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"]]
["created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": "hashtags": ["text": "Tenet", "indices": [82, 88]], "symbols": [], "user_mentions": [], "urls": [], "media": ["id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"]]
["created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": "hashtags": ["text": "Tenet", "indices": [82, 88]], "symbols": [], "user_mentions": [], "urls": [], "media": ["id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"]]
["created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": "hashtags": ["text": "Tenet", "indices": [82, 88]], "symbols": [], "user_mentions": [], "urls": [], "media": ["id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"]]
["created_at": "Sun Jun 14 20:20:28 +0000 2020", "id": 1272262651100434433, "id_str": "1272262651100434433", "truncated": false, "display_text_range": [0, 243], "entities": "hashtags": ["text": "Tenet", "indices": [82, 88]], "symbols": [], "user_mentions": [], "urls": [], "media": ["id": 1272262640753094656, "id_str": "1272262640753094656", "indices": [244, 267], "media_url": "http://pbs.twimg.com/media/Eaf8IYsWsAAHVHV.jpg"]]
您可以使用以下代码在 jsonl 文件中每行获取一个数据帧行。
import json
import pandas as pd
with open('data') as f:
df = pd.DataFrame(json.loads(line)[0] for line in f)
您的 df 将如下所示:
created_at id id_str truncated display_text_range entities
0 Sun Jun 14 20:20:28 +0000 2020 1272262651100434433 1272262651100434433 False [0, 243] 'hashtags': ['text': 'Tenet', 'indices': [82...
1 Sun Jun 14 20:20:28 +0000 2020 1272262651100434433 1272262651100434433 False [0, 243] 'hashtags': ['text': 'Tenet', 'indices': [82...
2 Sun Jun 14 20:20:28 +0000 2020 1272262651100434433 1272262651100434433 False [0, 243] 'hashtags': ['text': 'Tenet', 'indices': [82...
3 Sun Jun 14 20:20:28 +0000 2020 1272262651100434433 1272262651100434433 False [0, 243] 'hashtags': ['text': 'Tenet', 'indices': [82...
4 Sun Jun 14 20:20:28 +0000 2020 1272262651100434433 1272262651100434433 False [0, 243] 'hashtags': ['text': 'Tenet', 'indices': [82...
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 created_at 5 non-null object
1 id 5 non-null int64
2 id_str 5 non-null object
3 truncated 5 non-null bool
4 display_text_range 5 non-null object
5 entities 5 non-null object
dtypes: bool(1), int64(1), object(4)
memory usage: 333.0+ bytes
【讨论】:
谢谢您,先生。这是惊人而简单的代码,无需解压列表并将其转换为数据框。以上是关于将 JSON 行解包到 pandas 数据框的主要内容,如果未能解决你的问题,请参考以下文章