如何从字典列表中提取数据到熊猫数据框中?

Posted

技术标签:

【中文标题】如何从字典列表中提取数据到熊猫数据框中?【英文标题】:How to extract data from a list of dicts, into a pandas dataframe? 【发布时间】:2021-01-02 14:59:08 【问题描述】:

这是我在使用 Telethon API 运行 python 脚本后得到的 json 文件的一部分。

["_": "Message", "id": 4589, "to_id": "_": "PeerChannel", "channel_id": 1399858792, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": [], "_": "MessageService", "id": 4588, "to_id": "_": "PeerChannel", "channel_id": 1399858792, "date": "2020-09-03T11:48:18+00:00", "action": "_": "MessageActionChatJoinedByLink", "inviter_id": 310378430, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null

如您所见,python 脚本已从电报中的特定频道抓取聊天记录。我需要的只是将 json 的日期和消息部分存储到一个单独的数据框中,以便我可以应用适当的过滤器并给出适当的输出。谁能帮我解决这个问题?

【问题讨论】:

【参考方案1】: 这假定从 API 返回的对象不是字符串(例如 '[..., ...]'. 如果是字符串,请先使用data = json.loads(data)。 可以通过列表理解从dictslist 中提取'date' 和对应的'message'。 遍历list 中的每个dict,并将dict.get 用于key。如果密钥不存在,则返回None
import pandas as pd

# where data is the list of dicts, unpack the desired keys and load into pandas
df = pd.DataFrame(['date': i.get('date'), 'message': i.get('message') for i in data])

# display(df)
                        date                                                                                                                                                            message
0  2020-09-03T14:51:03+00:00  Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same
1  2020-09-03T11:48:18+00:00                                                                                                                                                               None

或者

如果你想跳过数据,'message'None
df = pd.DataFrame(['date': i['date'], 'message': i['message'] for i in data if i.get('message')])

                      date                                                                                                                                                            message
 2020-09-03T14:51:03+00:00  Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same

【讨论】:

【参考方案2】:

我认为您应该使用 json 加载然后 json_normalize 将 json 转换为数据帧,其中 max_level 用于嵌套字典。

from pandas import json_normalize
import json
d = '["_": "Message", "id": 4589, "to_id": "_": "PeerChannel", "channel_id": 1399858792, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": [], "_": "MessageService", "id": 4588, "to_id": "_": "PeerChannel", "channel_id": 1399858792, "date": "2020-09-03T11:48:18+00:00", "action": "_": "MessageActionChatJoinedByLink", "inviter_id": 310378430, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null]'
f = json.loads(d)
print(json_normalize(f, max_level=2))

【讨论】:

以上是关于如何从字典列表中提取数据到熊猫数据框中?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用嵌套字典列表展平熊猫数据框中的列

从存储为熊猫数据框中的字符串的列表中提取项目

如何从pandas DataFrame中制作字典列表?

从熊猫字典列表中提取元素

如何从熊猫数据框中提取首字母缩写词和缩写词?

如何从熊猫数据框中提取日期/年份/月份?