与迭代两个大型 Pandas 数据框相比，效率更高

Posted 2023-02-23

技术标签:

【中文标题】与迭代两个大型 Pandas 数据框相比，效率更高【英文标题】：Improved efficiency versus iterating over two large Pandas Dataframes 【发布时间】：2019-04-28 13:36:44 【问题描述】：

我有两个具有基于位置的值的 HUGE Pandas 数据帧，我需要使用来自 df2 的记录数更新 df1['count']，这些记录数距离 df1 中的每个点都小于 1000m。

这是我导入到 Pandas 中的数据示例

df1 =       lat      long    valA   count
        0   123.456  986.54  1      0
        1   223.456  886.54  2      0
        2   323.456  786.54  3      0
        3   423.456  686.54  2      0
        4   523.456  586.54  1      0

df2 =       lat      long    valB
        0   123.456  986.54  1
        1   223.456  886.54  2
        2   323.456  786.54  3
        3   423.456  686.54  2
        4   523.456  586.54  1

实际上，df1 有大约 1000 万行，df2 有大约 100 万行

我使用 Pandas DF.itertuples() 方法创建了一个有效的嵌套 FOR 循环，该方法适用于较小的测试数据集（df1=1k Rows & df2=100 Rows 大约需要一个小时才能完成），但完整的数据set 呈指数级增长，根据我的计算需要数年才能完成。这是我的工作代码...

import pandas as pd
import geopy.distance as gpd

file1 = 'C:\\path\\file1.csv'    
file2 = 'C:\\path\\file2.csv' 

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

df1.sort_values(['long', 'lat']), inplace=True) 
df2.sort_values(['long', 'lat']), inplace=True)

for irow in df1.itertuples():    
     count = 0
     indexLst = []        
     Location1 = (irow[1], irow[2])    

     for jrow in df2.itertuples():  
          Location2 = (jrow[1], jrow[2])                                      
          if gpd.distance(Location1, Location2).kilometers < 1:
             count += 1
             indexLst.append(jrow[0])    
     if count > 0:                  #only update DF if a match is found
         df1.at[irow[0],'count'] = (count)      
         df2.drop(indexLst, inplace=True)       #drop rows already counted from df2 to speed up next iteration

 #save updated df1 to new csv file
 outFileName = 'combined.csv'
 df1.to_csv(outFileName, sep=',', index=False)

df2 中的每个点只需要计算一次，因为 df1 中的点是均匀分布的。为此，我添加了一个 drop 语句，以便在计算完行后从 df2 中删除行，以期缩短迭代时间。我最初也尝试创建一个合并/连接语句，而不是嵌套循环，但没有成功。

现阶段，非常感谢您对提高效率的任何帮助！

编辑：目标是用 df2 中

df1 =       lat      long    valA   count
        0   123.456  986.54  1      3
        1   223.456  886.54  2      1
        2   323.456  786.54  3      9
        3   423.456  686.54  2      2
        4   523.456  586.54  1      5

【问题讨论】：

欢迎@dP8884，为了澄清问题，我理解这段代码的意图是从df1 获取一对纬度/经度，然后添加到纬度数的计数器/df2 中距离不到 1 公里的 /long 点？所以最后你会得到df1 中的纬度/经度，并更新到count 列，它在df2 中找到的点数小于1 公里？是的，没错。我将更新我的问题以反映预期的输出应该是什么样子。谢谢。应该有一种方法可以根据您知道超出范围的 lat/long 组合进行某种过滤（即，lat 或 long 相距超过一个度数），但我不知道不知道在你的情况下最好的方法。 【参考方案1】：

经常做这种事情，我发现了几个最佳实践：

1) 尽量使用numpy和numba

2) 尽量利用并行化

3) 跳过向量化代码的循环（我们在这里使用带有 numba 的循环来利用并行化）。

在这种特殊情况下，我想指出 geopy 带来的减速。虽然它是一个很棒的包并且可以产生非常准确的距离（与 Haversine 方法相比），但它的速度要慢得多（没有研究过实现的原因）。

import numpy as np
from geopy import distance

origin = (np.random.uniform(-90,90), np.random.uniform(-180,180))
dest = (np.random.uniform(-90,90), np.random.uniform(-180,180))

%timeit distance.distance(origin, dest)

每个循环 216 µs ± 363 ns（7 次运行的平均值 ± 标准偏差，每次 1000 个循环）

这意味着在该时间间隔内，计算 1000 万 x 100 万距离大约需要 2160000000 秒或 60 万小时。即使是并行也只能起到这么大的作用。

因为当点非常接近时您会感兴趣，我建议使用Haversine distance（在更远的距离处不太准确）。

from numba import jit, prange, vectorize

@vectorize
def haversine(s_lat,s_lng,e_lat,e_lng):

    # approximate radius of earth in km
    R = 6373.0

    s_lat = s_lat*np.pi/180.0                      
    s_lng = np.deg2rad(s_lng)     
    e_lat = np.deg2rad(e_lat)                       
    e_lng = np.deg2rad(e_lng)  

    d = np.sin((e_lat - s_lat)/2)**2 + np.cos(s_lat)*np.cos(e_lat) * np.sin((e_lng - s_lng)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))

%timeit haversine(origin[0], origin[0], dest[1], dest[1])

每个循环 1.85 µs ± 53.9 ns（7 次运行的平均值 ± 标准偏差，每次 100000 次循环）

这已经是 100 倍的改进。但我们可以做得更好。您可能已经注意到我从 numba 添加的 @vectorize 装饰器。这允许之前的标量 Haversine 函数被向量化，并将向量作为输入。我们将在下一步中利用这一点：

@jit(nopython=True, parallel=True)
def get_nearby_count(coords, coords2, max_dist):
    '''
    Input: `coords`: List of coordinates, lat-lngs in an n x 2 array
           `coords2`: Second list of coordinates, lat-lngs in an k x 2 array
           `max_dist`: Max distance to be considered nearby
    Output: Array of length n with a count of coords nearby coords2
    '''
    # initialize
    n = coords.shape[0]
    k = coords2.shape[0]
    output = np.zeros(n)

    # prange is a parallel loop when operations are independent
    for i in prange(n):
        # comparing a point in coords to the arrays in coords2
        x, y = coords[i]
        # returns an array of length k
        dist = haversine(x, y, coords2[:,0], coords2[:,1])
        # sum the boolean of distances less than the max allowable
        output[i] = np.sum(dist < max_dist)

    return output

希望您现在拥有一个等于第一组坐标长度的数组（在您的情况下为 1000 万）。然后，您可以将其分配给您的数据框作为您的计数！

测试时间 100,000 x 10,000：

n = 100_000
k = 10_000

coords1 = np.zeros((n, 2))
coords2 = np.zeros((k, 2))

coords1[:,0] = np.random.uniform(-90, 90, n)
coords1[:,1] = np.random.uniform(-180, 180, n)
coords2[:,0] = np.random.uniform(-90, 90, k)
coords2[:,1] = np.random.uniform(-180, 180, k)

%timeit get_nearby_count(coords1, coords2, 1.0)

每个循环 2.45 秒 ± 73.2 毫秒（7 次运行的平均值 ± 标准偏差，每次 1 个循环）

不幸的是，这仍然意味着您将看到大约 20,000 多秒的内容。这是在具有 80 个内核的机器上（使用 76ish，基于 top 使用情况）。

这是我目前能做的最好的事情，祝你好运（另外，第一次发帖，感谢你激励我做出贡献！）

PS：您还可以查看 Dask 数组和函数 map_block()，以并行化此函数（而不是依赖 prange）。您如何对数据进行分区可能会影响总执行时间。

PPS：1,000,000 x 100,000（比您的全套设备小 100 倍）耗时：3 分 27 秒（207 秒），因此缩放看起来是线性的并且有点宽容。

PPPS：使用简单的纬度差过滤器实现：

@jit(nopython=True, parallel=True)
def get_nearby_count_vlat(coords, coords2, max_dist):
    '''
    Input: `coords`: List of coordinates, lat-lngs in an n x 2 array
           `coords2`: List of port coordinates, lat-lngs in an k x 2 array
           `max_dist`: Max distance to be considered nearby
    Output: Array of length n with a count of coords nearby coords2
    '''
    # initialize
    n = coords.shape[0]
    k = coords2.shape[0]
    coords2_abs = np.abs(coords2)
    output = np.zeros(n)

    # prange is a parallel loop when operations are independent
    for i in prange(n):
        # comparing a point in coords to the arrays in coords2
        point = coords[i]
        # subsetting coords2 to reduce haversine calc time. Value .02 is from playing with Gmaps and will need to change for max_dist > 1.0
        coords2_filtered = coords2[np.abs(point[0] - coords2[:,0]) < .02]
        # in case of no matches
        if coords2_filtered.shape[0] == 0: continue
        # returns an array of length k
        dist = haversine(point[0], point[1], coords2_filtered[:,0], coords2_filtered[:,1])
        # sum the boolean of distances less than the max allowable
        output[i] = np.sum(dist < max_dist)

    return output

【讨论】：

谢谢！这很好，解释得很好。让我再消化一下，并在我的数据样本上实现它，然后我会让你知道结果如何。此外，这也是我的第一篇文章，所以感谢您的反馈，因为它似乎给了我足够的特权来开始投票（当然是你的第一个）。 :) Eliot K 提出了一个很好的观点，即通过减少搜索空间来加快速度，但地理坐标让我头疼。我想我找到了一种快速过滤结果的方法，但仅限于纬度。我在我的笔记本电脑（四核）上快速测试了它。我的原始方法耗时 70 秒（100k 坐标 x 50k 坐标），而快速纬度距离过滤器将其缩短至 2.27 秒。那是完全随机的坐标。您绝对可以改进过滤，尤其是使用排序的 df2。我将更改添加到上面的代码中。过滤经度似乎不值得（与 Haversine 相比成本更高）感谢 ernestk 和 Eliot K。这似乎使我的流程从大约 90 年缩短到 【参考方案2】：

我最近做了类似的事情，但不是纬度，经度，我只需要找到最近的点和它的距离。为此，我使用了 scipy.spatial.cKDTree 包。这是相当快的。 cKDTree

我认为在您的情况下，您可以使用 query_ball_point() 函数。

from scipy import spatial
import pandas as pd

file1 = 'C:\\path\\file1.csv'    
file2 = 'C:\\path\\file2.csv' 

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
# Build the index
tree = spatial.cKDTree(df1[['long', 'lat']])
# Then query the index

你应该试一试。

【讨论】：

以上是关于与迭代两个大型 Pandas 数据框相比，效率更高的主要内容，如果未能解决你的问题，请参考以下文章

将大型 Dask 数据框与小型 Pandas 数据框合并

迭代两个大型数据框以提取值的方式进行矢量化？

在 Pandas 中组合以下数据框的最简单方法

RPC服务和HTTP服务的区别

Pandas - 将大型数据框切成块

迭代 Pandas 分组数据框