在 Pandas Dataframe 中扁平化列表的更快方法

Posted 2023-02-23

技术标签:

【中文标题】在 Pandas Dataframe 中扁平化列表的更快方法【英文标题】：Faster way to flatten list in Pandas Dataframe 【发布时间】：2019-12-01 22:58:13 【问题描述】：

我在下面有一个数据框：

import pandas
df = pandas.DataFrame("terms" : [[['the', 'boy', 'and', 'the goat'],['a', 'girl', 'and', 'the cat']], [['fish', 'boy', 'with', 'the dog'],['when', 'girl', 'find', 'the mouse'], ['if', 'dog', 'see', 'the cat']]])

我想要的结果如下：

df2 = pandas.DataFrame("terms" : ['the boy  and the goat','a girl and the cat',  'fish boy with the dog','when girl find the mouse', 'if dog see the cat'])

有没有一种简单的方法来实现这一点，而不必使用 for 循环来遍历每个元素和子字符串的每一行：

result = pandas.DataFrame()
for i in range(len(df.terms.tolist())):
    x = df.terms.tolist()[i]
    for y in x:
        z = str(y).replace(",",'').replace("'",'').replace('[','').replace(']','')
        flattened = pandas.DataFrame('flattened_term':[z])
        result = result.append(flattened)

print(result)

谢谢。

【问题讨论】：

对于初学者，永远不要在循环中附加数据帧。将您的结果累积到一个列表中，然后将它们连接到最后。我想问一下，你最初是怎么得到第一个数据帧的？如果您的数据框中有列表，那么您当时可能不应该使用数据框 DataFrame 是从该结构中的某个来源提供的。 “从源头”是什么意思？ @juanpa.arrivillaga 你应该告诉他为什么他不应该在循环中追加。 【参考方案1】：

这当然不能避免循环，至少不是隐含的。 Pandas 不是为了将list 对象作为元素而创建的，它可以出色地处理数字数据，并且可以很好地处理字符串。无论如何，您的基本问题是您在循环中使用pd.Dataframe.append，这是一个二次时间算法（每次迭代都会重新创建整个数据帧）。但是您可能可以摆脱以下问题，并且应该会更快：

>>> df
                                               terms
0  [[the, boy, and, the goat], [a, girl, and, the...
1  [[fish, boy, with, the dog], [when, girl, find...
>>> pandas.DataFrame([' '.join(term) for row in df.itertuples() for term in row.terms])
                          0
0      the boy and the goat
1        a girl and the cat
2     fish boy with the dog
3  when girl find the mouse
4        if dog see the cat
>>>

【讨论】：

以上是关于在 Pandas Dataframe 中扁平化列表的更快方法的主要内容，如果未能解决你的问题，请参考以下文章

在 pandas DataFrame 中，如何使用索引将“扁平化”变量“扁平化”成新列？

从嵌套的 json 列表中展平 Pandas DataFrame

扁平化（不规则）Python 中关于 Pandas Dataframes 的列表列表

使用 pandas json_normalize 扁平化包含多个嵌套列表的字典列表

如何使用 sklearn 转换器扁平化 pandas 数据框中的数组类型？

在 Pandas 中使用条件列表过滤 DataFrame