从字典列表创建DataFrame - 其中值是列表本身[重复]

Posted 2023-03-12

技术标签:

【中文标题】从字典列表创建DataFrame - 其中值是列表本身[重复]【英文标题】：Create DataFrame from list of Dicts - Where values are lists themselves [duplicate] 【发布时间】：2018-07-27 20:07:00 【问题描述】：

您好，我想从dicts 列表中创建一个DataFrame，其中项目是列表。当项目是标量时，请参阅下面的 test，对 pd.DataFrame 的调用按预期工作：

test = ['points': 40, 'time': '5:00', 'year': 2010, 
'points': 25, 'time': '6:00', 'month': "february", 
'points':90, 'time': '9:00', 'month': 'january', 
'points_h1':20, 'month': 'june']

pd.DataFrame(test)

    month    points  points_h1  time    year
0   NaN      40.0    NaN        5:00    2010.0
1   february 25.0    NaN        6:00    NaN
2   january  90.0    NaN        9:00    NaN
3   june      NaN    20.0        NaN    NaN

但是，如果项目本身是列表，我会得到似乎是意外的结果：

test = ['points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011], 
'points': [25], 'time': ['6:00'], 'month': ["february"], 
'points':[90], 'time': ['9:00'], 'month': ['january'], 
'points_h1': [20], 'month': ['june']]

pd.DataFrame(test)

        month      points   points_h1          time            year
   0    NaN      [40, 50]   NaN         [5:00, 4:00]    [2010, 2011]
   1    february       25   NaN                 6:00             NaN
   2    january        90   NaN                 9:00             NaN
   3    june          NaN   20.0                 NaN             NaN

为了解决这个问题，我使用：pd.concat([pd.DataFrame(z) for z in test])，但这相对较慢，因为您必须为列表中的每个元素创建一个新的数据框，这需要大量开销。我错过了什么吗？

【问题讨论】：

@Idlehands，我认为你是正确的......似乎是重复的，谢谢！ 【参考方案1】：

虽然可以在 pandas 本身中使用，但使用 Python 似乎不那么困难，至少如果您有原始数据的话。

import pandas as pd

test = ['points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011], 'points': [25], 'time': ['6:00'], 'month': ["february"], 'points':[90], 'time': ['9:00'], 'month': ['january'], 'points_h1': [20], 'month': ['june']]

newtest = []
for t in test:
    newtest.extend([k:v for (k,v) in zip(t.keys(),values) for values in zip(*t.values())])

df = pd.DataFrame(newtest)
print (df)

结果：

      month  points  points_h1  time    year
0       NaN    40.0        NaN  5:00  2010.0
1       NaN    50.0        NaN  4:00  2011.0
2  february    25.0        NaN  6:00     NaN
3   january    90.0        NaN  9:00     NaN
4      june     NaN       20.0   NaN     NaN

【讨论】：

【参考方案2】：

使用pandas 可以使用多种方法来获取数据，但您发现它会变得非常繁重。我的建议是在传递给 pandas 之前填充您的数据：

import pandas as pd

test = ['points': [40, 50], 'time': ['5:00', '4:00'], 'year': [2010, 2011],
 'month': ['february'], 'points': [25], 'time': ['6:00'],
 'month': ['january'], 'points': [90], 'time': ['9:00'],
 'month': ['june'], 'points_h1': [20]]

def pad_data(data):

    # Set a dictionary with all the keys
    result = k:[] for i in data for k in i.keys()

    for i in data:

        # Determine the longest value as padding for NaNs
        pad = max([len(j) for j in i.values()])

        # Create padding dictionary and update current
        padded = key: [pd.np.nan]*pad for key in result.keys() if key not in i.keys()
        i.update(padded)

        # Finally extend to result dictionary
        for key, val in i.items():
            result[key].extend(val)

    return result

# Padded data looks like this:
#
# 'month': [nan, nan, 'february', 'january', 'june'],
#  'points': [40, 50, 25, 90, nan],
#  'points_h1': [nan, nan, nan, nan, 20],
#  'time': ['5:00', '4:00', '6:00', '9:00', nan],
#  'year': [2010, 2011, nan, nan, nan]

df = pd.DataFrame(pad_data(test), dtype='O')
print(df)

#       month points points_h1  time  year
# 0       NaN     40       NaN  5:00  2010
# 1       NaN     50       NaN  4:00  2011
# 2  february     25       NaN  6:00   NaN
# 3   january     90       NaN  9:00   NaN
# 4      june    NaN        20   NaN   NaN

【讨论】：

以上是关于从字典列表创建DataFrame - 其中值是列表本身[重复]的主要内容，如果未能解决你的问题，请参考以下文章