如何提取/拆分数据框中的列表列以分隔唯一列?

Posted

技术标签:

【中文标题】如何提取/拆分数据框中的列表列以分隔唯一列?【英文标题】:How to extract/split columns that are lists in a data frame to separate unique columns? 【发布时间】:2020-05-22 18:40:12 【问题描述】:

我有一个包含几列的数据框,如下所示:

              Age                         G                      GS 
INDEX1  [27, 25, 22, 30, 30]    [76, 79, 80, 76, 77]    [76, 79, 80, 76, 77]    
INDEX2  [24, 23, 21, 32, 34]    [77, 76, 81, 75, 77]    [77, 76, 81, 75, 77]    

如何将所有列表拆分为各自独立的列?理想情况下,一旦我完成,我的数据框将如下所示:

       Age   Age1  Age2   Age3   Age4   G    G1   G2   G3   G4
INDEX1  27    25    22     30     30    76   79   80   76   77  ...
... 

如果有帮助,我确实将字典转换为此数据框。我尝试在堆栈上搜索和实施几种不同的类似解决方案,但它们似乎都不起作用。此解决方案可以正确转换,但由于某种原因会创建两个 NaN 列。如果有人知道如何在整个数据帧上执行此操作,我可以删除额外的 NaN 列:

df1 = pd.DataFrame(converted['Age'].values.tolist())
df1


    0   1   2   3    4       5   6
0   27  25  22  30  30.0    NaN NaN
1   31  29  33  27  33.0    NaN NaN
2   22  21  26  21  33.0    NaN NaN
3   29  24  31  33  27.0    NaN NaN
4   30  21  31  31  32.0    NaN NaN
... ... ... ... ... ... ... ...
1727    28  27  28  20  26.0    NaN NaN
1728    20  29  27  24  20.0    NaN NaN
1729    30  31  34  25  26.0    NaN NaN
1730    31  26  34  21  21.0    NaN NaN
1731    22  24  20  28  25.0    NaN NaN

我尝试了其他一些解决方案,但年龄列出现错误,它可能与隐藏值有关,但我不确定。

df2 = pd.DataFrame()

for col in converted.columns:
    # names of new columns
    feature_columns  = [ "col_feature1".format(col=col), "col_feature2".format(col=col), "col_feature3".format(col=col)
                       , "col_feature4".format(col=col)
                       , "col_feature5".format(col=col)]
    # split current column
    df2[ feature_columns ] = df[ col ].apply(lambda s: pd.Series( feature_columns[0]: s[0],
                                                                   feature_columns[1]: s[1],
                                                                   feature_columns[2]: s[2],
                                                                   feature_columns[3]: s[3],
                                                                   feature_columns[4]: s[4] ) )

print (df2)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Age'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-180-53ed0043f9d8> in <module>
      7                        , "col_feature5".format(col=col)]
      8     # split current column
----> 9     df2[ feature_columns ] = df[ col ].apply(lambda s: pd.Series( feature_columns[0]: s[0],
     10                                                                    feature_columns[1]: s[1],
     11                                                                    feature_columns[2]: s[2],

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    377             except ValueError:
    378                 raise KeyError(key)
--> 379         return super().get_loc(key, method=method, tolerance=tolerance)
    380 
    381     @Appender(_index_shared_docs["get_indexer"])

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Age'

编辑:我尝试使用此处列出的解决方案:Pandas split column of lists into multiple columns

它对我不起作用。谢谢你的建议!

【问题讨论】:

@G.Anderson 感谢您的回复和欢迎!!我尝试实现explode,但它将每个值放在单独的行中。这实际上是我开始的地方,我正在尝试水平对齐我的数据。是否可以爆炸成新的列? 抱歉,我误解了请求 【参考方案1】:

用途:

new_df = pd.concat([pd.DataFrame(col.tolist(), index = df.index).add_prefix(i) 
                    for i, col in df.items()], axis = 1)
print(new_df)
        Age0  Age1  Age2  Age3  Age4  G0  G1  G2  G3  G4  GS0  GS1  GS2  GS3  \
INDEX1    27    25    22    30    30  76  79  80  76  77   76   79   80   76   
INDEX2    24    23    21    32    34  77  76  81  75  77   77   76   81   75   

        GS4  
INDEX1   77  
INDEX2   77  

最好只设置一次索引

new_df = pd.concat([pd.DataFrame(col.tolist()).add_prefix(i) 
                    for i, col in df.items()], axis = 1)
new_df.index = df.index

【讨论】:

这正是我一直在寻找的。不敢相信就这么简单!!非常感谢!!

以上是关于如何提取/拆分数据框中的列表列以分隔唯一列?的主要内容,如果未能解决你的问题,请参考以下文章

如何拆分对象列表以分隔pyspark数据框中的列

将字符串(或字符串列表)拆分为 spark 数据框中的各个列

在数据框中的分隔符处拆分列[重复]

Redshift - 拆分列以查找位置不确定的分隔符之间的值

将列中的唯一值分隔到同一数据框中的单独列中

将列表的列拆分为同一 PySpark 数据框中的多列