将元素为字典的列拆分为多列[重复]

Posted 2023-03-11

技术标签:

【中文标题】将元素为字典的列拆分为多列[重复]【英文标题】：Splitting a column whose elements are dictionaries into many columns [duplicate] 【发布时间】：2015-01-01 11:02:55 【问题描述】：

我有一个熊猫DataFrame 包含字典作为元素的单列。它是以下代码的结果：

dg # is a pandas dataframe with columns ID and VALUE. Many rows contain the same ID

def seriesFeatures(series):
    """This functions receives a series of VALUE for the same ID and extracts
    tens of complex features from the series, storing them into a dictionary"""
    dico = dict()
    dico['feature1'] = calculateFeature1
    dico['feature2'] = calculateFeature2
    # Many more features
    dico['feature50'] = calculateFeature50
    return dico

grouped = dg.groupby(['ID'])
dh = grouped['VALUE'].agg(  'all_features' : lambda s: seriesFeatures(s)  )
dh.reset_index()
# Here I get a dh DataFrame of a single column 'all_features' and
# dictionaries stored on its values. The keys are the feature's names

我需要以有效的方式将此'all_features' 列拆分为尽可能多的列（我有太多的行和列，我无法更改seriesFeatures 函数），所以输出将是具有列ID、FEATURE1、FEATURE2、FEATURE3、...、FEATURE50 的数据框。最好的方法是什么？

编辑

一个具体而简单的例子：

dg = pd.DataFrame( [ [1,10] , [1,15] , [1,13] , [2,14] , [2,16] ] , columns=['ID','VALUE'] )

def seriesFeatures(series):
    dico = dict()
    dico['feature1'] = len(series)
    dico['feature2'] = series.sum()
    return dico

grouped = dg.groupby(['ID'])
dh = grouped['VALUE'].agg(  'all_features' : lambda s: seriesFeatures(s)  )
dh.reset_index()

但是当我尝试用 pd.Series 或 pd.DataFrame 包装它时，它说如果数据是标量值，则必须提供索引。提供 index=['feature1','feature2']，我得到奇怪的结果，例如使用：dh = grouped['VALUE'].agg( 'all_features' : lambda s: pd.DataFrame( seriesFeatures(s) , index=['feature1','feature2'] ) )

【问题讨论】：

感谢案例！更新了我的答案。 【参考方案1】：

我认为您应该将字典包装在一个系列中，然后这将在 groupby 调用中展开（但随后使用 apply 而不是 agg，因为它不再是聚合（标量）结果）：

dh = grouped['VALUE'].aply(lambda s: pd.Series(seriesFeatures(s)))

之后，您可以将结果重塑为所需的格式。

对于您的简单示例，这似乎可行：

In [22]: dh = grouped['VALUE'].apply(lambda x: pd.Series(seriesFeatures(x)))
In [23]: dh

Out[23]:
ID
1   feature1     3
    feature2    38
2   feature1     2
    feature2    30
dtype: int64

In [26]: dh.unstack().reset_index()
Out[26]:
   ID  feature1  feature2
0   1         3        38
1   2         2        30

【讨论】：

谢谢。我不知道这个unstack 的事情，这似乎是一个不错的解决方案。

以上是关于将元素为字典的列拆分为多列[重复]的主要内容，如果未能解决你的问题，请参考以下文章