遍历 pandas 数据框中的所有列以在分隔符上拆分

Posted 2023-03-12

技术标签:

【中文标题】遍历 pandas 数据框中的所有列以在分隔符上拆分【英文标题】：Iterate over all columns in pandas dataframe to split on delimiter 【发布时间】：2018-11-17 22:56:51 【问题描述】：

我有一个如下所示的数据框：

    name   val
0   cat    ['Furry: yes', 'Fast: yes', 'Slimy: no', 'Living: yes']
1   dog    ['Furry: yes', 'Fast: yes', 'Slimy: no', 'Living: yes']
2   snail  ['Furry: no', 'Fast: no', 'Slimy: yes', 'Living: yes']
3   paper  ['Furry: no', 'Fast: no', 'Slimy: no', 'Living: no']

对于 val 列中列表中的每个项目，我想在 ':' 分隔符上拆分项目。然后我想让 item[0] 成为列名，而 item[1] 成为该特定列的值。像这样：

    name   Furry  Fast  Slimy  Living
0   cat    yes    yes   no     yes
1   dog    yes    yes   no     yes
2   snail  no     no    yes    yes
3   paper  no     no    no     no

我尝试将 apply(pd.Series) 用于 val 列，但这仍然给我留下了许多列，我必须手动进行拆分，或者弄清楚如何迭代地遍历所有列并进行拆分。我更喜欢从零开始拆分并创建列名。知道如何实现这一目标吗？

【问题讨论】：

【参考方案1】：

apply 和 split 来创建字典：

df.val = df.val.apply(lambda x: dict([i.split(': ') for i in x]))

apply 和 pd.Series 来创建列：

df.join(df.val.apply(pd.Series)).drop('val', 1)

    name Furry  Fast Slimy Living
0    cat   yes   yes    no    yes
1    dog   yes   yes    no    yes
2  snail    no    no   yes    yes
3  paper    no    no    no     no

【讨论】：

谢谢。但我得到这个错误： ValueError: dictionary update sequence element #5 has length 1; 2 是必需的。如何修改您提供的代码以绕过此错误？【参考方案2】：

pd.DataFrame 直接接受字典列表。因此，您可以通过列表推导构建数据框，然后加入。

L = [dict(i.split(': ') for i in x) for x in df['val']]

df = df[['name']].join(pd.DataFrame(L))

print(df)

    name Fast Furry Living Slimy
0    cat  yes   yes    yes    no
1    dog  yes   yes    yes    no
2  snail   no    no    yes   yes
3  paper   no    no     no    no

【讨论】：

@jpp 感谢代码。没有 apply(pd.Series) 看起来很有希望，因为这不是最佳的。但我收到以下错误： ValueError: dictionary update sequence element #5 has length 1; 2 是必需的。知道如何修改代码以绕过此错误吗？ @guru，对不起，我不确定。您可能必须给出一个最小的数据示例它演示了您如何得到错误。否则，我们可能只是猜测。 @jpp 道歉，我无意中分享了一些示例数据。但幸好我想通了！该列中的某些列表中有一些额外的元素，所以我删除了这些元素。然后你的代码完美运行！谢谢！

以上是关于遍历 pandas 数据框中的所有列以在分隔符上拆分的主要内容，如果未能解决你的问题，请参考以下文章

如何提取/拆分数据框中的列表列以分隔唯一列？

遍历 pyspark 数据框中的列，而不为单个列创建不同的数据框

遍历一列以查找每个位置第一次出现的特殊字符

循环遍历 pandas 数据框列中的列表元素以在新列中返回列表

在 Pandas 数据框中过滤多个列以获取相同的字符串

实体框架中的代码首先设置列以在 sql server 中键入 datetime2