将邮政编码 API 调用应用于数据框中的每一行

Posted 2023-04-15

技术标签:

【中文标题】将邮政编码 API 调用应用于数据框中的每一行【英文标题】：Apply postcode API call to each row in dataframe 【发布时间】：2021-09-22 15:08:55 【问题描述】：

在下面的代码块中，我有一个数据框geo，我想对其进行迭代以获得geo 中每个英国邮政编码的东、北、经度和纬度。我编写了一个函数来调用 API，另一个函数返回四个变量。

我已经使用邮政编码测试了get_data 调用，以证明它有效（这是任何人都可以使用的公共 API）：

import requests 
import pandas as pd

geo = spark.table('property_address').toPandas()


def call_api(url: str) -> dict:
  postcode_response =requests.get(url)
  return postcode_response.json()

def get_data(postcode):

  url = f"http://api.getthedata.com/postcode/postcode"
  
  req = r.get(url)
  

  results = req.json()['data']
  easting = results['easting']
  northing = results['northing']
  latitude = results['latitude']
  longitude = results ['longitude']
  
  return easting ,northing,latitude, longitude

get_data('SW1A 1AA')

Out[108]: (529090, 179645, '51.501009', '-0.141588')

我想做的是为geo 中的每一行运行它并将其作为数据集返回。我的研究使我找到了apply，我的尝试基于this guide。

我正在尝试在geo 中传递一个名为property_postcode 的列并迭代每一行以返回值，这是我的尝试：

def get_columns(row):
  column_name = 'property_postcode'
  api_param = row[column_name]
  easting,northing,latitude,longitude = get_data(api_param)
  row['east'] = easting
  row['north'] = northing
  row['lat'] = latitude
  row['long'] = longitude
  return row

geo= geo.apply(get_columns, axis=1)

display(geo)

我得到的错误是

`JSONDecodeError: Expecting value: line 1 column 1 (char 0)`

没有告诉我很多。寻求帮助\指针。

【问题讨论】：

您在寻找 Pandas 解决方案还是 PySpark 解决方案？要么，我猜 PySpark 背后可能更有魅力。 【参考方案1】：

不要尝试在函数中设置东、北、纬度和长列的值，而是从函数中返回它们。

from numpy import result_type
import requests
import pandas as pd

# geo = spark.table('property_address').toPandas()


def call_api(url: str) -> dict:
    postcode_response = requests.get(url)
    return postcode_response.json()


def get_data(postcode):
    url = f"http://api.getthedata.com/postcode/postcode"
    req = requests.get(url)

    if req.json()["status"] == "match":
        results = req.json()["data"]
        easting = results.get("easting")
        northing = results.get("northing")
        latitude = results.get("latitude")
        longitude = results.get("longitude")
    else:
        easting = None
        northing = None
        latitude = None
        longitude = None

    return easting, northing, latitude, longitude


def get_columns(code):
    api_param = code
    return get_data(api_param)


df = pd.DataFrame(
    
        "property_postcode": [
            "BE21 6NZ",
            "SW1A 1AA",
            "W1A 1AA",
            "DE21",
            "B31",
            "ST16 2NY",
            "S65 1EN",
        ]
    
)

df[["east", "north", "lat", "long"]] = df.apply(
    lambda row: get_columns(row["property_postcode"]), axis=1, result_type="expand"
)

print(df)

property_postcode	east	north	lat	long
BE21 6NZ	NaN	NaN	None	None
SW1A 1AA	529090	179645	51.501009	-0.141588
W1A 1AA	528887	181593	51.518561	-0.143799
DE21	NaN	NaN	None	None
B31	NaN	NaN	None	None
ST16 2NY	391913	323540	52.809346	-2.121413
S65 1EN	444830	394082	53.44163	-1.326573

【讨论】：

看起来不错，明天会审核并回复您。非常感谢。谢谢伙计，辛苦了，我需要创建我的geo 数据集的一个子集，以仅包含属性邮政编码，然后将其合并回来。由于数据质量原因，有些邮政编码无效，可以导致这失败，在运行应用程序时，是否可以采取任何措施来跳过/忽略/返回这些行的空值？否则，我只需要在前面的脚本中运行 DQ。我在使用一些虚拟数据测试代码时实际上遇到了这个问题。我没有进一步调查它，但您可能可以在 get_data 函数中做一些事情。例如，对于无效的邮政编码，为东、北等返回 NA。您能否发布一些包含有效/无效邮政编码的示例数据？无效的将是“DE21 6”或“B31”，有效的是“ST16 2NY”或“S65 1EN” 我已经更新了答案，返回 NaN/None 用于东/北等无效邮政编码。

以上是关于将邮政编码 API 调用应用于数据框中的每一行的主要内容，如果未能解决你的问题，请参考以下文章

将reduce应用于R数据框中一列的每一行，包含一个列表

为数据框中的每一行应用一个函数，用于另一个数据框中的每一行

Pyspark：UDF 将正则表达式应用于数据帧中的每一行

将函数应用于数据框中的每一列，观察每一列现有的数据类型

将数据帧返回函数应用于基础数据帧的每一行

来自 Lat/Lon 的邮政编码（批量查询）[重复]