如何将字典的文本文件读入 DataFrame
Posted
技术标签:
【中文标题】如何将字典的文本文件读入 DataFrame【英文标题】:How to Read a Text File of Dictionaries into a DataFrame 【发布时间】:2019-06-26 15:04:04 【问题描述】:我有一个来自 Kaggle of *** Royale 统计数据的文本文件。它采用 Python 字典的格式。我正在努力找出如何以有意义的方式将其读入文件。好奇最好的方法是什么。这是一个相当复杂的带有列表的字典。
这里的原始数据集: https://www.kaggle.com/s1m0n38/***-royale-matches-dataset
'players': 'right': 'deck': [['Mega Minion', '9'], ['Electro Wizard', '3'], ['Arrows', '11'], ['Lightning', '5'], ['Tombstone', '9'], ['The Log', '2'], ['Giant', '9'], ['Bowler', '5']], 'trophy': '4258', 'clan': 'TwoFiveOne', 'name': 'gpa raid', 'left': 'deck': [['Fireball', '9'], ['Archers', '12'], ['Goblins', '12'], ['Minions', '11'], ['Bomber', '12'], ['The Log', '2'], ['Barbarians', '12'], ['Royal Giant', '13']], 'trophy': '4325', 'clan': 'battusai', 'name': 'Supr4', 'type': 'ladder', 'result': ['2', '0'], 'time': '2017-07-12'
'players': 'right': 'deck': [['Ice Spirit', '10'], ['Valkyrie', '9'], ['Hog Rider', '9'], ['Inferno Tower', '9'], ['Goblins', '12'], ['Musketeer', '9'], ['Zap', '12'], ['Fireball', '9']], 'trophy': '4237', 'clan': 'The Wolves', 'name': 'TITAN', 'left': 'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4296', 'clan': 'battusai', 'name': 'Supr4', 'type': 'ladder', 'result': ['1', '0'], 'time': '2017-07-12'
'players': 'right': 'deck': [['Miner', '3'], ['Ice Golem', '9'], ['Spear Goblins', '12'], ['Minion Horde', '12'], ['Inferno Tower', '8'], ['The Log', '2'], ['Skeleton Army', '6'], ['Fireball', '10']], 'trophy': '4300', 'clan': '@LA PERLA NEGRA', 'name': 'Victor', 'left': 'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4267', 'clan': 'battusai', 'name': 'Supr4', 'type': 'ladder', 'result': ['0', '1'], 'time': '2017-07-12'
【问题讨论】:
【参考方案1】:我将您的数据保存到 .json
文件中,然后只需要遍历每一行并将其视为自己的 JSON 文件,然后我使用 pandas.json_normalize
将其加载到 DataFrame
中,我做了一些猜测您希望 df 的外观如何,但我想出了这个:
注意: 正确的JSON
需要有双引号而不是单引号,所以我使用替换来解决这个问题。小心不要使用 this 破坏里面的数据。
注意: 我得到这个工作的方式,我必须合并 'right'
和 'left'
所以你丢失了这些数据。如果需要,您可以使用 dict comp 作为解决方法
import json
import pandas as pd
with open('cr.json', 'r') as f:
df = None
for line in f:
data = json.loads(line.replace("'", '"'))
#needed to put the right and left keys together, maybe you can find a way around this, I wasn't
df1 = pd.json_normalize([data['players']['right'], data['players']['left']],
'deck',
['name', 'trophy', 'clan'],
meta_prefix='player.',
errors='ignore')
df = pd.concat([df, df1])
df.rename(columns=0: 'player.troop.name', 1: 'player.troop.level',
inplace=True)
print(df)
打印如下:
player.troop.name player.troop.level player.name player.clan \
0 Mega Minion 9 gpa raid TwoFiveOne
1 Electro Wizard 3 gpa raid TwoFiveOne
2 Arrows 11 gpa raid TwoFiveOne
3 Lightning 5 gpa raid TwoFiveOne
4 Tombstone 9 gpa raid TwoFiveOne
5 The Log 2 gpa raid TwoFiveOne
6 Giant 9 gpa raid TwoFiveOne
7 Bowler 5 gpa raid TwoFiveOne
8 Fireball 9 Supr4 battusai
9 Archers 12 Supr4 battusai
10 Goblins 12 Supr4 battusai
11 Minions 11 Supr4 battusai
12 Bomber 12 Supr4 battusai
13 The Log 2 Supr4 battusai
14 Barbarians 12 Supr4 battusai
15 Royal Giant 13 Supr4 battusai
0 Ice Spirit 10 TITAN The Wolves
1 Valkyrie 9 TITAN The Wolves
2 Hog Rider 9 TITAN The Wolves
3 Inferno Tower 9 TITAN The Wolves
4 Goblins 12 TITAN The Wolves
5 Musketeer 9 TITAN The Wolves
6 Zap 12 TITAN The Wolves
7 Fireball 9 TITAN The Wolves
8 Royal Giant 13 Supr4 battusai
9 Ice Wizard 2 Supr4 battusai
10 Bomber 12 Supr4 battusai
11 Knight 12 Supr4 battusai
12 Fireball 9 Supr4 battusai
13 Barbarians 12 Supr4 battusai
14 The Log 2 Supr4 battusai
15 Archers 12 Supr4 battusai
0 Miner 3 Victor @LA PERLA NEGRA
1 Ice Golem 9 Victor @LA PERLA NEGRA
2 Spear Goblins 12 Victor @LA PERLA NEGRA
3 Minion Horde 12 Victor @LA PERLA NEGRA
4 Inferno Tower 8 Victor @LA PERLA NEGRA
5 The Log 2 Victor @LA PERLA NEGRA
6 Skeleton Army 6 Victor @LA PERLA NEGRA
7 Fireball 10 Victor @LA PERLA NEGRA
8 Royal Giant 13 Supr4 battusai
9 Ice Wizard 2 Supr4 battusai
10 Bomber 12 Supr4 battusai
11 Knight 12 Supr4 battusai
12 Fireball 9 Supr4 battusai
13 Barbarians 12 Supr4 battusai
14 The Log 2 Supr4 battusai
15 Archers 12 Supr4 battusai
player.trophy
0 4258
1 4258
2 4258
3 4258
4 4258
5 4258
6 4258
7 4258
8 4325
9 4325
10 4325
11 4325
12 4325
13 4325
14 4325
15 4325
0 4237
1 4237
2 4237
3 4237
4 4237
5 4237
6 4237
7 4237
8 4296
9 4296
10 4296
11 4296
12 4296
13 4296
14 4296
15 4296
0 4300
1 4300
2 4300
3 4300
4 4300
5 4300
6 4300
7 4300
8 4267
9 4267
10 4267
11 4267
12 4267
13 4267
14 4267
15 4267
而df.iloc[0]
如下:
player.troop.name Mega Minion
player.troop.level 9
player.name gpa raid
player.trophy 4258
player.clan TwoFiveOne
Name: 0, dtype: object
您可以修改 json_normalize
参数,使其符合您的要求,但我希望这足以让您继续前进
【讨论】:
【参考方案2】:根据这个数据集的synopsis on kaggle,每个字典代表两个玩家之间的比赛。我觉得让数据框中的每一行代表单个匹配的所有特征是有意义的。
这可以通过几个简短的步骤来完成。
-
将所有匹配字典(来自 kaggle 的数据集的每一行)存储在一个列表中:
matches = [
'players': 'right': 'deck': [['Mega Minion', '9'], ['Electro Wizard', '3'], ['Arrows', '11'], ['Lightning', '5'], ['Tombstone', '9'], ['The Log', '2'], ['Giant', '9'], ['Bowler', '5']], 'trophy': '4258', 'clan': 'TwoFiveOne', 'name': 'gpa raid', 'left': 'deck': [['Fireball', '9'], ['Archers', '12'], ['Goblins', '12'], ['Minions', '11'], ['Bomber', '12'], ['The Log', '2'], ['Barbarians', '12'], ['Royal Giant', '13']], 'trophy': '4325', 'clan': 'battusai', 'name': 'Supr4', 'type': 'ladder', 'result': ['2', '0'], 'time': '2017-07-12',
'players': 'right': 'deck': [['Ice Spirit', '10'], ['Valkyrie', '9'], ['Hog Rider', '9'], ['Inferno Tower', '9'], ['Goblins', '12'], ['Musketeer', '9'], ['Zap', '12'], ['Fireball', '9']], 'trophy': '4237', 'clan': 'The Wolves', 'name': 'TITAN', 'left': 'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4296', 'clan': 'battusai', 'name': 'Supr4', 'type': 'ladder', 'result': ['1', '0'], 'time': '2017-07-12',
'players': 'right': 'deck': [['Miner', '3'], ['Ice Golem', '9'], ['Spear Goblins', '12'], ['Minion Horde', '12'], ['Inferno Tower', '8'], ['The Log', '2'], ['Skeleton Army', '6'], ['Fireball', '10']], 'trophy': '4300', 'clan': '@LA PERLA NEGRA', 'name': 'Victor', 'left': 'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4267', 'clan': 'battusai', 'name': 'Supr4', 'type': 'ladder', 'result': ['0', '1'], 'time': '2017-07-12'
]
-
从上面的列表中创建一个数据框,它将自动填充包含匹配的
type
、time
和result
信息的列:
df = pd.DataFrame(matches)
-
然后,使用一些简单的逻辑来填充包含比赛中左右球员的
deck
、trophy
、clan
和name
信息的列:
sides = ['right', 'left']
player_keys = ['deck', 'trophy', 'clan', 'name']
for side in sides:
for key in player_keys:
for i, row in df.iterrows():
df[side + '_' + key] = df['players'].apply(lambda x: x[side][key])
df = df.drop('players', axis=1) # no longer need this after populating the other columns
df = df.iloc[:, ::-1] # made sense to display columns in order of player info from left to right,
# followed by general match info at the far right of the dataframe
生成的数据框如下所示:
left_name left_clan left_trophy left_deck right_name right_clan right_trophy right_deck type time result
0 Supr4 battusai 4325 [[Fireball, 9], [Archers, 12], [Goblins, 12], ... gpa raid TwoFiveOne 4258 [[Mega Minion, 9], [Electro Wizard, 3], [Arrow... ladder 2017-07-12 [2, 0]
1 Supr4 battusai 4296 [[Royal Giant, 13], [Ice Wizard, 2], [Bomber, ... TITAN The Wolves 4237 [[Ice Spirit, 10], [Valkyrie, 9], [Hog Rider, ... ladder 2017-07-12 [1, 0]
2 Supr4 battusai 4267 [[Royal Giant, 13], [Ice Wizard, 2], [Bomber, ... Victor @LA PERLA NEGRA 4300 [[Miner, 3], [Ice Golem, 9], [Spear Goblins, 1... ladder 2017-07-12 [0, 1]
【讨论】:
【参考方案3】: 其他答案仅适用于玩具数据,如 OP 中所述。这个答案涉及来自 Kaggle 的实际文件,以及如何清理它。 Kaggle file,matches.txt
,是嵌套的行dicts
在文件中,每行有 4 个***键,['players', 'type', 'result', 'time']
读入文件,这将使每一行成为str
类型
将其从str
转换为dict
类型与ast.literal_eval
某些行的格式不正确,将导致SyntaxError
可以使用pandas.json_normalize
将数据转换为数据帧
进口
import pandas as pd
from ast import literal_eval
清理文件
# store the data
data = list()
# store the broken rows
broken_row = list()
# read in the file
with open('matches.txt', 'r', encoding='utf-8') as f:
# read the rows
rows = f.readlines()
for row in rows:
# try to convert a row from a string to dict
try:
row = literal_eval(row)
data.append(row)
except SyntaxError:
broken_row.append(row)
continue
将data
转换为长DataFrame
对于每场比赛,每个'players.right.deck', 'players.left.deck'
都有一个单独的行。
# convert data to a dataframe
players = pd.json_normalize(data)
# add a unique id for each row, which can be used to identify players for a particular game
df['id'] = df.index
# split the list of lists in right.deck and left.deck to separate rows
players = df[['id', 'players.right.deck', 'players.left.deck']].apply(pd.Series.explode).reset_index(drop=True)
# drop the original columns
df.drop(columns=['players.right.deck', 'players.left.deck'], inplace=True)
# right.deck and left.deck are still a list with two values, which need to have separate columns
players[['right.deck.name', 'right.deck.number']] = pd.DataFrame(players.pop('players.right.deck').values.tolist())
players[['left.deck.name', 'left.deck.number']] = pd.DataFrame(players.pop('players.left.deck').values.tolist())
# separate the result column into two columns
df[['right.result', 'left.result']] = pd.DataFrame(df.pop('result').values.tolist())
# merge df with players
df = df.merge(players, on='id')
df.head(8)
type time players.right.trophy players.right.clan players.right.name players.left.trophy players.left.clan players.left.name id right.result left.result right.deck.name right.deck.number left.deck.name left.deck.number
0 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Mega Minion 9 Fireball 9
1 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Electro Wizard 3 Archers 12
2 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Arrows 11 Goblins 12
3 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Lightning 5 Minions 11
4 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Tombstone 9 Bomber 12
5 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 The Log 2 The Log 2
6 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Giant 9 Barbarians 12
7 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Bowler 5 Royal Giant 13
将data
转换为宽数据帧
此选项使用flatten_json
函数。
对于每个匹配项,每个 'players.right.deck', 'players.left.deck'
都有一个单独的列。
# convert data to a wide dataframe
df = pd.DataFrame([flatten_json(x) for x in data])
# display(df.head(3))
players_right_deck_0_0 players_right_deck_0_1 players_right_deck_1_0 players_right_deck_1_1 players_right_deck_2_0 players_right_deck_2_1 players_right_deck_3_0 players_right_deck_3_1 players_right_deck_4_0 players_right_deck_4_1 players_right_deck_5_0 players_right_deck_5_1 players_right_deck_6_0 players_right_deck_6_1 players_right_deck_7_0 players_right_deck_7_1 players_right_trophy players_right_clan players_right_name players_left_deck_0_0 players_left_deck_0_1 players_left_deck_1_0 players_left_deck_1_1 players_left_deck_2_0 players_left_deck_2_1 players_left_deck_3_0 players_left_deck_3_1 players_left_deck_4_0 players_left_deck_4_1 players_left_deck_5_0 players_left_deck_5_1 players_left_deck_6_0 players_left_deck_6_1 players_left_deck_7_0 players_left_deck_7_1 players_left_trophy players_left_clan players_left_name type result_0 result_1 time
0 Mega Minion 9 Electro Wizard 3 Arrows 11 Lightning 5 Tombstone 9 The Log 2 Giant 9 Bowler 5 4258 TwoFiveOne gpa raid Fireball 9 Archers 12 Goblins 12 Minions 11 Bomber 12 The Log 2 Barbarians 12 Royal Giant 13 4325 battusai Supr4 ladder 2 0 2017-07-12
1 Ice Spirit 10 Valkyrie 9 Hog Rider 9 Inferno Tower 9 Goblins 12 Musketeer 9 Zap 12 Fireball 9 4237 The Wolves TITAN Royal Giant 13 Ice Wizard 2 Bomber 12 Knight 12 Fireball 9 Barbarians 12 The Log 2 Archers 12 4296 battusai Supr4 ladder 1 0 2017-07-12
2 Miner 3 Ice Golem 9 Spear Goblins 12 Minion Horde 12 Inferno Tower 8 The Log 2 Skeleton Army 6 Fireball 10 4300 @LA PERLA NEGRA Victor Royal Giant 13 Ice Wizard 2 Bomber 12 Knight 12 Fireball 9 Barbarians 12 The Log 2 Archers 12 4267 battusai Supr4 ladder 0 1 2017-07-12
【讨论】:
以上是关于如何将字典的文本文件读入 DataFrame的主要内容,如果未能解决你的问题,请参考以下文章