从字典列表创建DataFrame - 其中值是列表本身[重复]
Posted
技术标签:
【中文标题】从字典列表创建DataFrame - 其中值是列表本身[重复]【英文标题】:Create DataFrame from list of Dicts - Where values are lists themselves [duplicate] 【发布时间】:2018-07-27 20:07:00 【问题描述】:您好,我想从dicts
列表中创建一个DataFrame
,其中项目是列表。当项目是标量时,请参阅下面的 test
,对 pd.DataFrame
的调用按预期工作:
test = ['points': 40, 'time': '5:00', 'year': 2010,
'points': 25, 'time': '6:00', 'month': "february",
'points':90, 'time': '9:00', 'month': 'january',
'points_h1':20, 'month': 'june']
pd.DataFrame(test)
month points points_h1 time year
0 NaN 40.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
但是,如果项目本身是列表,我会得到似乎是意外的结果:
test = ['points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011],
'points': [25], 'time': ['6:00'], 'month': ["february"],
'points':[90], 'time': ['9:00'], 'month': ['january'],
'points_h1': [20], 'month': ['june']]
pd.DataFrame(test)
month points points_h1 time year
0 NaN [40, 50] NaN [5:00, 4:00] [2010, 2011]
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
为了解决这个问题,我使用:pd.concat([pd.DataFrame(z) for z in test])
,但这相对较慢,因为您必须为列表中的每个元素创建一个新的数据框,这需要大量开销。我错过了什么吗?
【问题讨论】:
@Idlehands,我认为你是正确的......似乎是重复的,谢谢! 【参考方案1】:虽然可以在 pandas 本身中使用,但使用 Python 似乎不那么困难,至少如果您有原始数据的话。
import pandas as pd
test = ['points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011], 'points': [25], 'time': ['6:00'], 'month': ["february"], 'points':[90], 'time': ['9:00'], 'month': ['january'], 'points_h1': [20], 'month': ['june']]
newtest = []
for t in test:
newtest.extend([k:v for (k,v) in zip(t.keys(),values) for values in zip(*t.values())])
df = pd.DataFrame(newtest)
print (df)
结果:
month points points_h1 time year
0 NaN 40.0 NaN 5:00 2010.0
1 NaN 50.0 NaN 4:00 2011.0
2 february 25.0 NaN 6:00 NaN
3 january 90.0 NaN 9:00 NaN
4 june NaN 20.0 NaN NaN
【讨论】:
【参考方案2】:使用pandas
可以使用多种方法来获取数据,但您发现它会变得非常繁重。我的建议是在传递给 pandas 之前填充您的数据:
import pandas as pd
test = ['points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011],
'month': ['february'], 'points': [25], 'time': ['6:00'],
'month': ['january'], 'points': [90], 'time': ['9:00'],
'month': ['june'], 'points_h1': [20]]
def pad_data(data):
# Set a dictionary with all the keys
result = k:[] for i in data for k in i.keys()
for i in data:
# Determine the longest value as padding for NaNs
pad = max([len(j) for j in i.values()])
# Create padding dictionary and update current
padded = key: [pd.np.nan]*pad for key in result.keys() if key not in i.keys()
i.update(padded)
# Finally extend to result dictionary
for key, val in i.items():
result[key].extend(val)
return result
# Padded data looks like this:
#
# 'month': [nan, nan, 'february', 'january', 'june'],
# 'points': [40, 50, 25, 90, nan],
# 'points_h1': [nan, nan, nan, nan, 20],
# 'time': ['5:00', '4:00', '6:00', '9:00', nan],
# 'year': [2010, 2011, nan, nan, nan]
df = pd.DataFrame(pad_data(test), dtype='O')
print(df)
# month points points_h1 time year
# 0 NaN 40 NaN 5:00 2010
# 1 NaN 50 NaN 4:00 2011
# 2 february 25 NaN 6:00 NaN
# 3 january 90 NaN 9:00 NaN
# 4 june NaN 20 NaN NaN
【讨论】:
以上是关于从字典列表创建DataFrame - 其中值是列表本身[重复]的主要内容,如果未能解决你的问题,请参考以下文章
从字典创建 Python DataFrame,其中键是列名,值是行