提高 Pandas DataFrames 的行追加性能

Posted 2023-02-23

技术标签:

【中文标题】提高 Pandas DataFrames 的行追加性能【英文标题】：Improve Row Append Performance On Pandas DataFrames 【发布时间】：2015-03-11 20:11:55 【问题描述】：

我正在运行一个循环嵌套字典的基本脚本，从每条记录中获取数据，并将其附加到 Pandas DataFrame。数据看起来像这样：

data = "SomeCity": "Date1": record1, record2, record3, ..., "Date2": , ..., ...

总共有几百万条记录。脚本本身如下所示：

city = ["SomeCity"]
df = DataFrame(, columns=['Date', 'HouseID', 'Price'])
for city in cities:
    for dateRun in data[city]:
        for record in data[city][dateRun]:
            recSeries = Series([record['Timestamp'], 
                                record['Id'], 
                                record['Price']],
                                index = ['Date', 'HouseID', 'Price'])
            FredDF = FredDF.append(recSeries, ignore_index=True)

然而，这运行得非常缓慢。在寻找并行化它的方法之前，我只是想确保我没有遗漏一些明显的东西，因为我对 Pandas 还是很陌生。

【问题讨论】：

你看过from_dict吗？将行附加到 DataFrames 本质上是低效的。尝试一次性创建具有最终大小的整个 DataFrame。正如 EdChum 所说，在这种情况下，您可以使用 from_dict 来执行此操作。谢谢！我会试一试，看看效果如何。 【参考方案1】：

我还在循环中使用了数据框的 append 函数，但我很困惑它运行的速度有多慢。

基于此页面上的正确答案，为受苦的人提供了一个有用的例子。

Python 版本：3

熊猫版本：0.20.3

# the dictionary to pass to pandas dataframe
d = 

# a counter to use to add entries to "dict"
i = 0 

# Example data to loop and append to a dataframe
data = ["foo": "foo_val_1", "bar": "bar_val_1", 
       "foo": "foo_val_2", "bar": "bar_val_2"]

# the loop
for entry in data:

    # add a dictionary entry to the final dictionary
    d[i] = "col_1_title": entry['foo'], "col_2_title": entry['bar']
    
    # increment the counter
    i = i + 1

# create the dataframe using 'from_dict'
# important to set the 'orient' parameter to "index" to make the keys as rows
df = DataFrame.from_dict(d, "index")

“from_dict”函数：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

【讨论】：

这个例子肯定很有帮助！这肯定是一种快速的方法，但由于 Python 的默认字典不是有序的，excel 中的数据可能会随机混合。我强烈推荐使用集合中的 OrderedDict 库。这真的很快。现在大约需要 20 秒的操作在几毫秒内完成。非常感谢:) 很棒的提示。非常有用。对于我的用例，我使用这种方法从 45 多分钟缩短到不到 5 分钟。我从将近 2 小时缩短到不到 5 秒 xD 谢谢！【参考方案2】：

将行附加到列表比 DataFrame 高效得多。因此你会想要

DataFrame

【讨论】：

伟大而简单的解决方案！对于所有搜索第 2 步实施的人：Simple do df = pd.DataFrame(my_list, columns=['col1', 'col2'])。真是好简单的解决方案，希望你去ArbaeenWalk拿奖 【参考方案3】：

我认为最好的方法是，如果您知道要接收的数据，请提前分配。

import numpy as np
import pandas as pd

random_matrix = np.random.randn(100, 100)
insert_df = pd.DataFrame(random_matrix)

df = pd.DataFrame(columns=range(100), index=range(200))
df.loc[range(100), df.columns] = random_matrix
df.loc[range(100, 200), df.columns] = random_matrix

这是我认为最有意义的模式。 append 会更快，如果您的数据框非常小，但无法扩展。

In [1]: import numpy as np; import pandas as pd

In [2]: random_matrix = np.random.randn(100, 100)
   ...: insert_df = pd.DataFrame(random_matrix)
   ...: df = pd.DataFrame(np.random.randn(100, 100))

In [2]: %timeit df.append(insert_df)
272 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
493 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(100), df.columns] = insert_df
821 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

当我们使用 100,000 行数据帧运行此程序时，我们会看到更显着的结果。

In [1]: df = pd.DataFrame(np.random.randn(100_000, 100))

In [2]: %timeit df.append(insert_df)
17.9 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
465 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(99_900, 100_000), df.columns] = random_matrix
465 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit df.loc[range(99_900, 100_000), df.columns] = insert_df
1.02 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

所以我们可以看到追加比使用数据帧插入慢约 17 倍，比使用 numpy 数组插入慢 35 倍。

【讨论】：

【参考方案4】：

另一种方法是把它做成一个列表，然后使用pd.concat

import pandas as pd 

df = pd.DataFrame('num_legs': [2, 4, 8, 0],

                   'num_wings': [2, 0, 0, 0],

                   'num_specimen_seen': [10, 2, 1, 8],

                  index=['falcon', 'dog', 'spider', 'fish'])

def append(df):
    df_out = df.copy()
    for i in range(1000):
        df_out = df_out.append(df)
    return df_out

def concat(df):
    df_list = []
    for i in range(1001):
        df_list.append(df)

    return pd.concat(df_list)


# some testing
df2 = concat(df)
df3 = append(df)

pd.testing.assert_frame_equal(df2,df3)

%timeit concat(df):

每个循环 20.2 ms ± 794 µs（平均值 ± 标准偏差，7 次运行，每次 100 个循环）

%timeit append(df)

每个循环 275 毫秒 ± 2.54 毫秒（7 次运行的平均值 ± 标准偏差，每次 1 个循环）

现在推荐在 pandas 中连接行：

迭代地将行附加到 DataFrame 可能比单个连接的计算量更大。更好的解决方案是将这些行附加到列表中，然后将列表与原始 DataFrame 一次性连接起来。 link

【讨论】：

【参考方案5】：

我遇到了类似的问题，我必须多次附加到 DataFrame，但在附加之前不知道值。我写了一个轻量级的 DataFrame，类似于数据结构，它只是底层的 blists()。我用它来累积所有数据，然后在完成后将输出转换为 Pandas DataFrame。这是我的项目的链接，全部开源，希望对其他人有所帮助：

https://pypi.python.org/pypi/raccoon

【讨论】：

不错的库 - 将它添加到我的核心 mvp 中【参考方案6】：

在我的例子中，我从不同的文件中加载了大量具有相同列的数据框，并希望将它们附加以创建一个大数据框。

我的解决方案是首先将所有数据帧加载到一个列表中，然后使用

all_dfs = []
for i in all_files:
  all_dfs.append(/* load df from file */)

master_df = pd.concat(all_dfs, ignore_index=True)

【讨论】：

这对我来说效果很好。

以上是关于提高 Pandas DataFrames 的行追加性能的主要内容，如果未能解决你的问题，请参考以下文章