python - 如何使用python pandas为超过1M点的点集中的每个点找到最近的8个点

Posted 2023-04-18

技术标签:

【中文标题】python - 如何使用python pandas为超过1M点的点集中的每个点找到最近的8个点【英文标题】：How to find the nearest 8 points for each point in a point set with more than 1M points with python pandas 【发布时间】：2019-06-18 03:00:40 【问题描述】：

我有数百个 gz 文件，每个包含大约 0.5M~1M 矩形框的坐标，每个框都有一个唯一的索引，称为 localIdx，每个框的坐标是 llx, lly, urx, ury, 我可以得到 x/y x=(llx+urx)/2, y=(lly+ury)/2 的每个框，以便我将框转换为点，现在我想为每个返回其 localIdx 的点（框）找到最近的 8 个点（框）。

这是我的工作：

1. read in the gz files with python pandas
2. set the column 'localIdx' for each point as index
3. get the height and width for each box by h=ury-lly, w=urx-llx
4. for each point, filter in points that x is in range current_point_x +/- 20*w, y is in range current_point_y +/- 20*h
5. convert to the filtered_in_points x/y and current_point x/y into two 2D numpy array
6. get the Euclidean Distance by scipy.spatial.distance.cdist
7. merge the result of step6 to the filtered_in pandas to map the localIdx
8. selected 8 nearest localIdx and combine them as a string
9. give the localIdx string for each point

这是我代码中的核心功能：

    def seek_norm_list(line, target_df=None, rmax=None, nmax=None, keycol=None):
    if line.padType == 'DUT':
        res_id = []
        key_value = line[keycol]
        current_pad = np.array([[line.xbbox, line.ybbox]])
        h, w = line['h'], line['w']
        h1, h2 = line.ybbox - h*20, line.ybbox + h*20
        w1, w2 = line.xbbox - w*20, line.xbbox + w*20
        target_mask = (target_df['xbbox'] > h1) & (target_df['xbbox'] < h2) & (target_df['ybbox'] > w1) & (target_df['ybbox'] < w2)
        target_df = target_df[target_mask]
        nbh_blks = line.nbh_blk.split(":")
        a = np.array(list(zip(target_df.xbbox, target_df.ybbox)))
        if len(a) > 0:
            d = scipy.spatial.distance.cdist(a, current_pad)
            target_df['dist'] = d
            key_target = target_df[target_df[keycol] == key_value]
            key_target.sort_values(by='dist', inplace=True)
            res_target = key_target[key_target.dist < rmax]
            keep_id = list(res_target['localIdx'])
            if line['localIdx'] in keep_id:
                keep_id.remove(line['localIdx'])
            if len(keep_id) > int(nmax):
                keep_id = keep_id[:int(nmax)]
            for bk in nbh_blks:
                for id in keep_id:
                    if bk in id:
                        res_id.append(id)
            line['normList'] = ":".join(res_id)
            line['refCount'] = len(res_id)
            if len(res_id) > 0:
                min, max = keep_id[0], keep_id[-1]
                line['minDist'] = res_target.loc[min, 'dist']
                line['maxDist'] = res_target.loc[max, 'dist']
            else:
                line['minDist'] = ''
                line['maxDist'] = ''
        else:
            line['normList'], line['refCount'] = '', ''
            line['minDist'], line['maxDist'] = '', ''
        return line
    else:
        line['normList'], line['refCount'] = '', ''
        line['minDist'], line['maxDist'] = '', ''
        return line

对于每个 gz 文件来说，这非常非常慢，就我而言，大约有 600 个文件。所有文件的总行> 120M行。我在我的 16 核机器上使用了多处理。

我想让它在3小时内得到结果，用python可以吗？

【问题讨论】：

【参考方案1】：

K 个最近的邻居。可以使用 sklearn。

【讨论】：

谢谢！我阅读了 sklearn NearestNeighbors 的 API，但我发现在我的情况下，我仍然需要计算 >100M 次，因为我想找到每个点的最近点。看起来它仍然很慢

以上是关于python - 如何使用python pandas为超过1M点的点集中的每个点找到最近的8个点的主要内容，如果未能解决你的问题，请参考以下文章