有没有办法使用 pgeocode 加快在大型数据帧上查询邮政编码的经纬度？

Posted 2023-04-13

技术标签:

【中文标题】有没有办法使用 pgeocode 加快在大型数据帧上查询邮政编码的经纬度？【英文标题】：Is there a way to speed up querying latitude-longitude of postcodes on a large dataframe using pgeocode? 【发布时间】：2021-10-20 05:38:06 【问题描述】：

我有一个大约 10 万行的数据框，其中包含邮政编码和国家/地区代码。我想获取每个位置的纬度和经度并将其保存在两个新列中。我有一个关于数据帧样本的工作代码（例如 100 行），但在所有数据帧上运行它需要很长时间（> 1 小时）。我是 Python 新手，我怀疑应该有一种更快的方法来做到这一点：

postcode

country_code

get_lat(pcode, country)

get_long(pcode, country)

下面是我的代码示例。

import pgeocode
import numpy as np
import pandas as pd

#Sample data
df = pd.DataFrame('postcode':['3011','3083','3071','2660','9308','9999'], 'country_code': ['NL','NL','NL','BE','BE','DE'])

#There are blank postcodes and postcodes that pgeocode cannot return any value, so I am using try-except (e.g. last row in sample dataframe): 
#function to get latitude 
def get_lat(pcode, country):
    try:
        nomi = pgeocode.Nominatim(country)
        x = nomi.query_postal_code(pcode).latitude
        return x
    except:
        return np.NaN

#function to get longitude
def get_long(pcode, country):
    try:
        nomi = pgeocode.Nominatim(country)
        x = nomi.query_postal_code(pcode).longitude
        return x
    except:
        return np.NaN

#Find and create new columns for latitude-longitude based on postcode (ex: 5625) and country of postcode (ex: NL)
df['latitude'] = np.vectorize(get_lat)(df['postcode'],df['country_code'])
df['longitude'] = np.vectorize(get_long)(df['postcode'],df['country_code'])

【问题讨论】：

【参考方案1】：

作为替代解决方案，我从以下网站下载了 txt 文件：http://download.geonames.org/export/zip/

下载文件后，只需导入txt文件并加入即可。它更快但静态，即您在更早的时间使用邮政编码数据库的快照。

另一个优点是您可以检查文件并检查邮政编码的格式。使用 pgeocode 时，更难跟踪接受的邮政编码格式并理解查询返回 null 的原因。

【讨论】：

以上是关于有没有办法使用 pgeocode 加快在大型数据帧上查询邮政编码的经纬度？的主要内容，如果未能解决你的问题，请参考以下文章

无法将日志功能应用于 pyspark 数据帧

加快从 pandas 数据帧到 mysql 的数据插入

有没有办法加快 django 中的身份验证功能？

有没有办法为 UIScrollView 中的不同图层设置不同的帧大小？

spark - 在大型数据帧上执行 groupby 和聚合时，java 堆内存不足

如何加快将数据帧导入熊猫