在 pandas 中对函数进行矢量化
Posted
技术标签:
【中文标题】在 pandas 中对函数进行矢量化【英文标题】:Vectorizing a function in pandas 【发布时间】:2014-12-20 00:39:03 【问题描述】:我有一个包含纬度/经度坐标列表的数据框:
d = 'Provider ID': 0: '10001',
1: '10005',
2: '10006',
3: '10007',
4: '10008',
5: '10011',
6: '10012',
7: '10016',
8: '10018',
9: '10019',
'latitude': 0: '31.215379379000467',
1: '34.22133455500045',
2: '34.795039606000444',
3: '31.292159523000464',
4: '31.69311635000048',
5: '33.595265517000485',
6: '34.44060759100046',
7: '33.254429322000476',
8: '33.50314015000049',
9: '34.74643089500046',
'longitude': 0: ' -85.36146587999968',
1: ' -86.15937514799964',
2: ' -87.68507485299966',
3: ' -86.25539902199966',
4: ' -86.26549483099967',
5: ' -86.66531866799966',
6: ' -85.75726760699968',
7: ' -86.81407933399964',
8: ' -86.80242858299965',
9: ' -87.69893502799965'
df = pd.DataFrame(d)
我的目标是使用 hasrsine 函数来计算 KM 中每个项目之间的距离:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# 6367 km is the radius of the Earth
km = 6367 * c
return km
我的目标是获得一个看起来像下面的 result_df 的数据框,其中的值是每个提供者 id 之间的距离:
result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID'])
我可以循环执行此操作,但是速度非常慢。我正在寻求将其转换为矢量化方法的帮助:
for first_hospital_coordinates in result_df.columns:
for second_hospital_coordinates in result_df['Provider ID']:
if first_hospital_coordinates == 'Provider ID':
pass
else:
L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values
O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values
L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values
O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values
distance = haversine(O1, L1, O2, L2)
crit = result_df['Provider ID'] == second_hospital_coordinates
result_df.loc[crit, first_hospital_coordinates] = distance
【问题讨论】:
我回答了一个类似的问题:***.com/questions/25767596/… 关闭,haversine 方面是一样的。虽然以有效的方式创建 10x10 矩阵是主要问题 【参考方案1】:要对这段代码进行矢量化,您需要对完整的数据帧进行操作,而不是对单独的经纬度进行操作。我已经对此进行了尝试。我需要结果 df 和一个新函数 h2,
import numpy as np
def h2(df, p):
inrad = df.applymap(radians)
dlon = inrad.longitude-inrad.longitude[p]
dlat = inrad.latitude-inrad.latitude[p]
lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
c = 2 * 1/np.sin(np.sqrt(a))
km = 6367 * c
return km
df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index, columns = df.index)
for c in df2.columns:
df2[c] = h2(df, c)
print (df2)
这应该会产生,(我不确定我是否有正确的答案......我的目标是矢量化代码)
Provider ID 10001 10005 10006 10007 \
Provider ID
10001 inf 5.021936e+05 5.270062e+05 1.649088e+06
10005 5.021936e+05 inf 9.294868e+05 4.985233e+05
10006 5.270062e+05 9.294868e+05 inf 4.548412e+05
10007 1.649088e+06 4.985233e+05 4.548412e+05 inf
10008 1.460299e+06 5.777248e+05 5.246954e+05 3.638231e+06
10011 6.723581e+05 2.004199e+06 1.027439e+06 6.394402e+05
10012 4.559090e+05 3.265536e+06 7.573411e+05 4.694125e+05
10016 7.680036e+05 1.429573e+06 9.105474e+05 7.517467e+05
10018 7.096548e+05 1.733554e+06 1.020976e+06 6.701920e+05
10019 5.436342e+05 9.278739e+05 2.891822e+07 4.638858e+05
Provider ID 10008 10011 10012 10016 \
Provider ID
10001 1.460299e+06 6.723581e+05 4.559090e+05 7.680036e+05
10005 5.777248e+05 2.004199e+06 3.265536e+06 1.429573e+06
10006 5.246954e+05 1.027439e+06 7.573411e+05 9.105474e+05
10007 3.638231e+06 6.394402e+05 4.694125e+05 7.517467e+05
10008 inf 7.766998e+05 5.401081e+05 9.496953e+05
10011 7.766998e+05 inf 1.341775e+06 4.220911e+06
10012 5.401081e+05 1.341775e+06 inf 1.119063e+06
10016 9.496953e+05 4.220911e+06 1.119063e+06 inf
10018 8.236437e+05 1.242451e+07 1.226941e+06 5.866259e+06
10019 5.372119e+05 1.051748e+06 7.514774e+05 9.362341e+05
Provider ID 10018 10019
Provider ID
10001 7.096548e+05 5.436342e+05
10005 1.733554e+06 9.278739e+05
10006 1.020976e+06 2.891822e+07
10007 6.701920e+05 4.638858e+05
10008 8.236437e+05 5.372119e+05
10011 1.242451e+07 1.051748e+06
10012 1.226941e+06 7.514774e+05
10016 5.866259e+06 9.362341e+05
10018 inf 1.048895e+06
10019 1.048895e+06 inf
[10 rows x 10 columns]
【讨论】:
我改变了一些东西:将 inrad = df.applymap(radians) 移出函数并进入预处理意味着它在我的真实集合中没有执行 4000 多次。然后,不是列表理解,而是 data = np.empty(len(df.index)); data.fill(df.latitude[column]); lat1 = pd.Series(index=df.index, data=data) 似乎要快一些。感谢您的帮助! @DataSwede 您能否将解决方案发布为编辑,以便我们看到最终答案的样子?谢谢!!【参考方案2】:您不需要任何花哨的东西,只需对您的功能进行一些修改。
首先,不要使用math
库。如果您正在做真正的数学或科学,那么使用 numpy 可能会更好。
其次,我们将使用数据框方法apply
。 apply
所做的是它接受一个函数并通过它运行每一行 (axis=1) 或每一列 (axis=0),并使用所有返回值构建一个新的 pandas 对象。所以我们需要设置haversine
来获取数据框的行并解包这些值。它变成:
def haversine(row):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
import numpy as np
# convert all of the row to radians
row = np.radians(row)
# unpack the values for convenience
lat1 = row['lat1']
lat2 = row['lat2']
lon1 = row['lon1']
lon2 = row['lon2']
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
# 6367 km is the radius of the Earth
km = 6367 * c
return km
好的,现在我们需要整理好您的数据框。在你的问题中,一切都是一个字符串,这不适合做数学。所以使用你的变量d
,我说:
df = pandas.DataFrame(d).set_index('Provider ID').astype(float)
这样就创建了字符串数据框,将提供者设置为索引,然后将所有列转换为浮点数,因为我们正在做数学运算。
现在我们需要用两组坐标创建行。为此,我们将使用shift
方法并将结果连接到原始数据帧。一次完成所有操作看起来像这样:
df = df.join(df.shift(), lsuffix='1', rsuffix='2')
print(df.head())
lat1 lon1 lat2 lon2
Provider ID
10001 31.215379 -85.361466 NaN NaN
10005 34.221335 -86.159375 31.215379 -85.361466
10006 34.795040 -87.685075 34.221335 -86.159375
10007 31.292160 -86.255399 34.795040 -87.685075
10008 31.693116 -86.265495 31.292160 -86.255399
rsuffix
和 lsuffix
在连接操作期间将“1”和“2”附加到列名。
“2”列来自df.shift()
,您会注意到它们等于前一行的“1”列。您还会看到“2”列的第一行是NaN
,因为第一行之前没有任何内容previous。
所以现在我们可以apply
Haversine 函数:
distance = df.apply(haversine, axis=1)
print(distance)
Provider ID
10001 NaN
10005 342.261590
10006 153.567591
10007 411.393751
10008 44.566642
10011 214.661170
10012 125.775583
10016 163.973219
10018 27.659157
10019 160.901128
dtype: float64
【讨论】:
啊,你离得太近了!距离需要是每个提供者 ID 到每个其他提供者 ID,而不仅仅是它下面的行。输出应该是一个以距离为值的 10x10 矩阵。 啊,我明白了。 @DataSwede。很高兴你把它整理好了。【参考方案3】:您应该能够对整个事物进行操作。我对 Pandas 不是很熟悉,所以我只使用底层的 numpy
数组。使用您的数据d
:
df = pd.DataFrame(d)
df1 = df.astype(float)
a = np.radians(df1.values[:,1:])
# a.shape is 10,2, it contains the Lat/Lon only
# transpose and subtract
# add a new axes so they can be broadcast
diff = a[...,np.newaxis] - a.T
# diff.shape is (10,2,10): dLat is diff[:,0,:], dLon is diff[:,1,:]
b = np.square(np.sin(diff / 2))
# b.shape is (10,2,10): sin^2(dLat/2) is b[:,0,:], sin^2(dLon/2) is b[:,1,:]
# make this term: cos(Lat1) * cos(Lat2)
cos_Lat = np.cos(a[:,0])
c = cos_Lat * cos_Lat[:, np.newaxis] # shape 10x10
# sin^2(dLon/2) is b[:,1,:]
b[:,1,:] = b[:,1,:] * c
g = b.sum(axis = 1)
h = 6367000 * 2 * np.arcsin((np.sqrt(g))) # meters
返回pandas.DataFrame
df2 = pd.DataFrame(h, index = df['Provider ID'].values, columns = df['Provider ID'].values)
我没有尝试任何性能测试。有很多中间数组创建正在进行,并且可能很昂贵 - 使用 ufuncs
的可选输出参数可能会缓解这种情况。
就地操作也一样:
df = pd.DataFrame(d)
df_A = df.astype(float)
z = df_A.values[:,1:]
# cos(Lat1) * cos(Lat2)
w = np.cos(z[:,0])
w = w * w[:, np.newaxis] # w.shape is (10,10)
# sin^2(dLat/2) and sin^2(dLon/2)
np.radians(z, z)
z = z[...,np.newaxis] - z.T
np.divide(z, 2, z)
np.sin(z, z)
np.square(z,z)
# z.shape is now (10,2,10): sin^2(dLat/2) is z[:,0,:], sin^2(dLon/2) is z[:,1,:]
# cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
np.multiply(z[:,1,:], w, z[:,1,:])
# sin^2(dLat/2) + cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
z = z.sum(axis = 1)
np.sqrt(z, z)
np.arcsin(z,z)
np.multiply(z, 6367000 * 2, z) #meters
df_B = pd.DataFrame(z, index = df['Provider ID'].values, columns = df['Provider ID'].values)
【讨论】:
以上是关于在 pandas 中对函数进行矢量化的主要内容,如果未能解决你的问题,请参考以下文章