提高 Pandas 合并性能

Posted 2023-02-16

技术标签:

【中文标题】提高 Pandas 合并性能【英文标题】：Improve Pandas Merge performance 【发布时间】：2017-04-13 03:05:39 【问题描述】：

正如其他帖子所建议的那样，我特别没有 Pands Merge 的性能问题，但我有一个包含很多方法的类，它对数据集进行了很多合并。

该课程有大约 10 个分组和大约 15 个合并。虽然 groupby 非常快，但在 1.5 秒的总执行时间中，这 15 次合并调用大约需要 0.7 秒。

我想加快这些合并调用的性能。由于我将进行大约 4000 次迭代，因此在单次迭代中整体节省 0.5 秒将导致整体性能降低约 30 分钟，这将是非常棒的。

有什么我应该尝试的建议吗？我试过了：赛通 Numba，而 Numba 更慢。

谢谢

编辑 1：添加示例代码sn-ps：我的合并语句：

tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left')

并且，通过实施联接，我合并了以下声明：

dat = self.data.set_index('APPT_NBR')

t1.set_index('APPT_NBR', inplace=True)
t2.set_index('APPT_NBR', inplace=True)
t3.set_index('APPT_NBR', inplace=True)
t4.set_index('APPT_NBR', inplace=True)
t5.set_index('APPT_NBR', inplace=True)

tmpDf = dat.join(t1, how='left')
tmpDf = tmpDf.join(t2, how='left')
tmpDf = tmpDf.join(t3, how='left')
tmpDf = tmpDf.join(t4, how='left')
tmpDf = tmpDf.join(t5, how='left')

tmpDf.reset_index(inplace=True)

注意，它们都是函数的一部分，名为：def merge_earlier_created_values(self):

而且，当我通过 profilehooks 进行定时调用时：

@timedcall(immediate=True)
def merge_earlier_created_values(self):

我得到以下结果：

该方法的分析结果给出：

@profile(immediate=True)
def merge_earlier_created_values(self):

使用 Merge 的函数剖析如下：

*** PROFILER RESULTS ***
merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon     Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122)
function called 1 times

     71665 function calls (70588 primitive calls) in 0.524 seconds

Ordered by: cumulative time, internal time, call count
List reduced from 563 to 40 due to restriction <40>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.012    0.012    0.524    0.524 get_prev_data_by_date.py:122(merge_earlier_created_values)
   14    0.000    0.000    0.285    0.020 generic.py:1901(_update_inplace)
   14    0.000    0.000    0.285    0.020 generic.py:1402(_maybe_update_cacher)
   19    0.000    0.000    0.284    0.015 generic.py:1492(_check_setitem_copy)
    7    0.283    0.040    0.283    0.040 built-in method gc.collect
   15    0.000    0.000    0.181    0.012 generic.py:1842(drop)
   10    0.000    0.000    0.153    0.015 merge.py:26(merge)
   10    0.000    0.000    0.140    0.014 merge.py:201(get_result)
  8/4    0.000    0.000    0.126    0.031 decorators.py:65(wrapper)
    4    0.000    0.000    0.126    0.031 frame.py:3028(drop_duplicates)
    1    0.000    0.000    0.102    0.102 get_prev_data_by_date.py:264(recreate_previous_cartons)
    1    0.000    0.000    0.101    0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
    1    0.000    0.000    0.098    0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type)
   10    0.000    0.000    0.092    0.009 internals.py:4455(concatenate_block_managers)
   10    0.001    0.000    0.088    0.009 internals.py:4471(<listcomp>)
  120    0.001    0.000    0.084    0.001 internals.py:4559(concatenate_join_units)
  266    0.004    0.000    0.067    0.000 common.py:733(take_nd)
  120    0.000    0.000    0.061    0.001 internals.py:4569(<listcomp>)
  120    0.003    0.000    0.061    0.001 internals.py:4814(get_reindexed_values)
    1    0.000    0.000    0.059    0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status)
   10    0.000    0.000    0.038    0.004 merge.py:322(_get_join_info)
   10    0.001    0.000    0.036    0.004 merge.py:516(_get_join_indexers)
   25    0.001    0.000    0.024    0.001 merge.py:687(_factorize_keys)
   74    0.023    0.000    0.023    0.000 pandas.algos.take_2d_axis1_object_object
   50    0.022    0.000    0.022    0.000 method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects
  120    0.003    0.000    0.022    0.000 internals.py:4479(get_empty_dtype_and_na)
   88    0.000    0.000    0.021    0.000 frame.py:1969(__getitem__)
    1    0.000    0.000    0.019    0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
   39    0.000    0.000    0.018    0.000 internals.py:3495(reindex_indexer)
  537    0.017    0.000    0.017    0.000 built-in method numpy.core.multiarray.empty
   15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)
   15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)
   24    0.000    0.000    0.014    0.001 internals.py:3625(take)
   10    0.000    0.000    0.014    0.001 merge.py:157(__init__)
   10    0.000    0.000    0.014    0.001 merge.py:382(_get_merge_keys)
   15    0.008    0.001    0.013    0.001 ops.py:662(na_op)
  234    0.000    0.000    0.013    0.000 common.py:158(isnull)
  234    0.001    0.000    0.013    0.000 common.py:179(_isnull_new)
   15    0.000    0.000    0.012    0.001 generic.py:1609(take)
   20    0.000    0.000    0.012    0.001 generic.py:2191(reindex)

使用 Joins 的 profiling 如下：

65079 function calls (63990 primitive calls) in 0.550 seconds

Ordered by: cumulative time, internal time, call count
List reduced from 592 to 40 due to restriction <40>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.016    0.016    0.550    0.550 get_prev_data_by_date.py:122(merge_earlier_created_values)
   14    0.000    0.000    0.295    0.021 generic.py:1901(_update_inplace)
   14    0.000    0.000    0.295    0.021 generic.py:1402(_maybe_update_cacher)
   19    0.000    0.000    0.294    0.015 generic.py:1492(_check_setitem_copy)
    7    0.293    0.042    0.293    0.042 built-in method gc.collect
   10    0.000    0.000    0.173    0.017 generic.py:1842(drop)
   10    0.000    0.000    0.139    0.014 merge.py:26(merge)
  8/4    0.000    0.000    0.138    0.034 decorators.py:65(wrapper)
    4    0.000    0.000    0.138    0.034 frame.py:3028(drop_duplicates)
   10    0.000    0.000    0.132    0.013 merge.py:201(get_result)
    5    0.000    0.000    0.122    0.024 frame.py:4324(join)
    5    0.000    0.000    0.122    0.024 frame.py:4371(_join_compat)
    1    0.000    0.000    0.111    0.111 get_prev_data_by_date.py:264(recreate_previous_cartons)
    1    0.000    0.000    0.103    0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
    1    0.000    0.000    0.099    0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type)
   10    0.000    0.000    0.093    0.009 internals.py:4455(concatenate_block_managers)
   10    0.001    0.000    0.089    0.009 internals.py:4471(<listcomp>)
  100    0.001    0.000    0.085    0.001 internals.py:4559(concatenate_join_units)
  205    0.003    0.000    0.068    0.000 common.py:733(take_nd)
  100    0.000    0.000    0.060    0.001 internals.py:4569(<listcomp>)
  100    0.001    0.000    0.060    0.001 internals.py:4814(get_reindexed_values)
    1    0.000    0.000    0.056    0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status)
   10    0.000    0.000    0.033    0.003 merge.py:322(_get_join_info)
   52    0.031    0.001    0.031    0.001 pandas.algos.take_2d_axis1_object_object
    5    0.000    0.000    0.030    0.006 base.py:2329(join)
   37    0.001    0.000    0.027    0.001 internals.py:2754(apply)
    6    0.000    0.000    0.024    0.004 frame.py:2763(set_index)
    7    0.000    0.000    0.023    0.003 merge.py:516(_get_join_indexers)
    2    0.000    0.000    0.022    0.011 base.py:2483(_join_non_unique)
    7    0.000    0.000    0.021    0.003 generic.py:2950(copy)
    7    0.000    0.000    0.021    0.003 internals.py:3046(copy)
   84    0.000    0.000    0.020    0.000 frame.py:1969(__getitem__)
   19    0.001    0.000    0.019    0.001 merge.py:687(_factorize_keys)
  100    0.002    0.000    0.019    0.000 internals.py:4479(get_empty_dtype_and_na)
    1    0.000    0.000    0.018    0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
   15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)
   34    0.001    0.000    0.017    0.000 internals.py:3495(reindex_indexer)
   83    0.004    0.000    0.016    0.000 internals.py:3211(_consolidate_inplace)
   68    0.015    0.000    0.015    0.000 method 'copy' of 'numpy.ndarray' objects
   15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)

正如你所看到的，合并比连接快，虽然它的值很小，但是超过 4000 次迭代，这个小值变成了一个巨大的数字，以分钟为单位。

谢谢

【问题讨论】：

将合并列设置为索引，并改用df1.join(df2)。 【参考方案1】：

我建议您将合并列设置为索引，并使用df1.join(df2) 而不是merge，这样会快得多。

这里有一些例子，包括分析：

In [1]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(1000000), columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.arange(1000000), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

这是 A 和 A2 上的常规左合并：

In [2]: %%timeit
        x = df1.merge(df2, how='left', left_on='A', right_on='A2')

1 loop, best of 3: 441 ms per loop

同样的，使用join：

In [3]: %%timeit
        x = df1.set_index('A').join(df2.set_index('A2'), how='left')

1 loop, best of 3: 184 ms per loop

现在很明显，如果你可以在循环之前设置索引，时间方面的增益会更大：

# Do this before looping
In [4]: %%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)

CPU times: user 9.78 ms, sys: 9.31 ms, total: 19.1 ms
Wall time: 16.8 ms

然后在循环中，你会得到在这种情况下快 30 倍的东西：

In [5]: %%timeit
        x = df1.join(df2, how='left')
100 loops, best of 3: 14.3 ms per loop

【讨论】：

这是一个左合并/连接。合并中的参数如何“左”，这将与加入一起使用？不知何故，我没有看到我的数据集的性能有太大改善。如果我将所有合并转换为联接，则时间会增加大约 0.1-0.3 秒。我将一些合并转换为连接，并且可以将时间减少约 0.2 秒。有什么，我不见了？或者任何我需要生成的代码？很好的解决方案，但请确保在您的 df 中保留密钥 col(s)，b/c set_index 默认会删除它们（例如，使用：df1.set_index('A', inplace=True, drop=False)。还有一个问题是原来的索引可能还需要，但是加入后变成d2.index。所以在加入后使用.reset_index(inplace=True, drop=True)重置索引可能是谨慎的。最后...:) 默认情况下，所有连接操作都会按行重新排列数据，因此如果排序很重要，您必须保留唯一键并重新排序数据（例如用于目视检查或变量是否具有时间分量）。【参考方案2】：

合并列上的 set_index 确实加快了速度。下面是julien-marrec's Answer 的稍微真实的版本。

import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

%%timeit
    x = df1.merge(df2, how='left', left_on='A', right_on='A2')   
#1 loop, best of 3: 664 ms per loop

%%timeit  
    x = df1.set_index('A').join(df2.set_index('A2'), how='left') 
#1 loop, best of 3: 354 ms per loop

%%time 
    df1.set_index('A', inplace=True)
    df2.set_index('A2', inplace=True)
#Wall time: 16 ms

%%timeit
    x = df1.join(df2, how='left')  
#10 loops, best of 3: 80.4 ms per loop

当要连接的列在两个表上的整数顺序不同时，您仍然可以期待 8 倍的大幅加速。

【讨论】：

一个简短的解释为什么按索引而不是按“普通”列合并更快：索引有一个哈希表。这意味着您可以在摊销 O(1) 中查找它们。对于普通列，在最坏情况下您需要 O(n)，这意味着将两个 dfs 与 len n 合并在最坏情况下需要 O(n^2)。在我的情况下，DataFrame.merge() 明显更快（x5）。我正在从左侧的 3m+ 行数据框和右侧的 900+ 行数据框进行左连接。我的索引是字符串，这几乎是我能看到的唯一解释请注意：速度增益取决于您的索引是否唯一。如果索引不是唯一的，则合并索引上的两个数据帧甚至可能需要更长的时间。这仍然适用于多索引吗？ x = df1.set_index(['A','B']).join(df2.set_index((['A','B']), how='left') ? @Intelligent-Infrastructure 是的，它确实适用于多索引。查看官方文档pandas.pydata.org/docs/reference/api/…。【参考方案3】：

我不知道这是否值得一个新的答案，但就个人而言，以下技巧帮助我改进了我必须在大 DataFrame（数百万行和数百列）上执行的连接：

除了使用 set_index(index, inplace=True)，您可能还想使用 sort_index(inplace=True) 对其进行排序。如果您的索引未排序，这会大大加快连接速度。例如，使用创建 DataFrame

import random
import pandas as pd
import numpy as np

nbre_items = 100000

ids = np.arange(nbre_items)
random.shuffle(ids)

df1 = pd.DataFrame("id": ids)
df1['value'] = 1
df1.set_index("id", inplace=True)

random.shuffle(ids)

df2 = pd.DataFrame("id": ids)
df2['value2'] = 2
df2.set_index("id", inplace=True)

我得到了以下结果：

%timeit df1.join(df2)
13.2 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

并且在对索引进行排序之后（这需要有限的时间）：

df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
%timeit df1.join(df2)
764 µs ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

您可以将其中一个 DataFrame 拆分为多个列，列数更少。这个技巧给了我不同的结果，所以使用它时要小心。例如：

for i in range(0, df2.shape[1], 100):
    df1 = df1.join(df2.iloc[:, i:min(df2.shape[1], (i + 100))], how='outer')

【讨论】：

为了兼容比较你应该包括两个sort_index操作。您可以使用%%timeit 进行多行计时，并将代码放在它下面的行中感谢您的提示！我在 %timeit 中同时考虑了 sort_index 进行了测试，仍然得到了一个快 3 倍的完整进程。因此，在无序索引的情况下，这似乎仍然有帮助。虽然排序可以持续与正常连接本身一样长......它确实改善了异常长连接的连接时间（通常是顺序执行的多个连接中的第一个） sort_index 真的帮了我大忙！ pd.concat() 从 10 多秒缩短到几分之一秒！

以上是关于提高 Pandas 合并性能的主要内容，如果未能解决你的问题，请参考以下文章

提高 Pandas DataFrames 的行追加性能

使用 Pandas 从大型 HDFStore 表中提高查询性能

合并或展平背景节点以提高游戏性能？

提高多连接合并的性能，包括更新

请求合并的 3 种方式，大大提高接口性能！

将多线程合并到 C++ 中如何提高性能，为啥？