使用 Python 进行反距离加权 (IDW) 插值
Posted
技术标签:
【中文标题】使用 Python 进行反距离加权 (IDW) 插值【英文标题】:Inverse Distance Weighted (IDW) Interpolation with Python 【发布时间】:2011-03-07 11:53:11 【问题描述】:问题: 对于点位置,在 Python 中计算逆距离加权 (IDW) 插值的最佳方法是什么?
一些背景: 目前我正在使用 RPy2 与 R 及其 gstat 模块进行交互。不幸的是,gstat 模块与我通过在单独的进程中运行基于 RPy2 的分析来解决的 arcgisscripting 冲突。即使这个问题在最近/未来的版本中得到解决,并且效率可以提高,我仍然想消除对安装 R 的依赖。
gstat 网站确实提供了一个独立的可执行文件,它更容易与我的 python 脚本打包,但我仍然希望有一个不需要多次写入磁盘和启动外部进程的 Python 解决方案。在我正在执行的处理中,对不同点和值集的插值函数的调用次数可能接近 20,000。
我特别需要对点进行插值,因此在性能方面,使用 ArcGIS 中的 IDW 函数生成栅格听起来比使用 R 还要糟糕.....除非有一种方法可以有效地仅屏蔽我的点需要。即使进行了这种修改,我也不认为性能会那么好。我将把这个选项作为另一种选择。更新:这里的问题是您与您使用的单元格大小有关。如果您减小像元大小以获得更好的精度,则处理需要很长时间。您还需要通过点提取来跟进......如果您想要特定点的值,这一切都是一种丑陋的方法。
我查看了scipy documentation,但似乎没有直接的方法来计算 IDW。
我正在考虑推出自己的实现,可能会使用一些 scipy 功能来定位最近的点并计算距离。
我是否遗漏了一些明显的东西?是否有一个我没见过的 python 模块完全符合我的要求?借助 scipy 创建自己的实现是明智的选择吗?
【问题讨论】:
【参考方案1】:10 月 20 日更改:此类 Invdisttree 结合了反距离加权和 scipy.spatial.KDTree. 忘记原始的蛮力答案; 这是离散数据插值的首选方法。
""" invdisttree.py: inverse-distance-weighted interpolation using KDTree
fast, solid, local
"""
from __future__ import division
import numpy as np
from scipy.spatial import cKDTree as KDTree
# http://docs.scipy.org/doc/scipy/reference/spatial.html
__date__ = "2010-11-09 Nov" # weights, doc
#...............................................................................
class Invdisttree:
""" inverse-distance-weighted interpolation using KDTree:
invdisttree = Invdisttree( X, z ) -- data points, values
interpol = invdisttree( q, nnear=3, eps=0, p=1, weights=None, stat=0 )
interpolates z from the 3 points nearest each query point q;
For example, interpol[ a query point q ]
finds the 3 data points nearest q, at distances d1 d2 d3
and returns the IDW average of the values z1 z2 z3
(z1/d1 + z2/d2 + z3/d3)
/ (1/d1 + 1/d2 + 1/d3)
= .55 z1 + .27 z2 + .18 z3 for distances 1 2 3
q may be one point, or a batch of points.
eps: approximate nearest, dist <= (1 + eps) * true nearest
p: use 1 / distance**p
weights: optional multipliers for 1 / distance**p, of the same shape as q
stat: accumulate wsum, wn for average weights
How many nearest neighbors should one take ?
a) start with 8 11 14 .. 28 in 2d 3d 4d .. 10d; see Wendel's formula
b) make 3 runs with nnear= e.g. 6 8 10, and look at the results --
|interpol 6 - interpol 8| etc., or |f - interpol*| if you have f(q).
I find that runtimes don't increase much at all with nnear -- ymmv.
p=1, p=2 ?
p=2 weights nearer points more, farther points less.
In 2d, the circles around query points have areas ~ distance**2,
so p=2 is inverse-area weighting. For example,
(z1/area1 + z2/area2 + z3/area3)
/ (1/area1 + 1/area2 + 1/area3)
= .74 z1 + .18 z2 + .08 z3 for distances 1 2 3
Similarly, in 3d, p=3 is inverse-volume weighting.
Scaling:
if different X coordinates measure different things, Euclidean distance
can be way off. For example, if X0 is in the range 0 to 1
but X1 0 to 1000, the X1 distances will swamp X0;
rescale the data, i.e. make X0.std() ~= X1.std() .
A nice property of IDW is that it's scale-free around query points:
if I have values z1 z2 z3 from 3 points at distances d1 d2 d3,
the IDW average
(z1/d1 + z2/d2 + z3/d3)
/ (1/d1 + 1/d2 + 1/d3)
is the same for distances 1 2 3, or 10 20 30 -- only the ratios matter.
In contrast, the commonly-used Gaussian kernel exp( - (distance/h)**2 )
is exceedingly sensitive to distance and to h.
"""
# anykernel( dj / av dj ) is also scale-free
# error analysis, |f(x) - idw(x)| ? todo: regular grid, nnear ndim+1, 2*ndim
def __init__( self, X, z, leafsize=10, stat=0 ):
assert len(X) == len(z), "len(X) %d != len(z) %d" % (len(X), len(z))
self.tree = KDTree( X, leafsize=leafsize ) # build the tree
self.z = z
self.stat = stat
self.wn = 0
self.wsum = None;
def __call__( self, q, nnear=6, eps=0, p=1, weights=None ):
# nnear nearest neighbours of each query point --
q = np.asarray(q)
qdim = q.ndim
if qdim == 1:
q = np.array([q])
if self.wsum is None:
self.wsum = np.zeros(nnear)
self.distances, self.ix = self.tree.query( q, k=nnear, eps=eps )
interpol = np.zeros( (len(self.distances),) + np.shape(self.z[0]) )
jinterpol = 0
for dist, ix in zip( self.distances, self.ix ):
if nnear == 1:
wz = self.z[ix]
elif dist[0] < 1e-10:
wz = self.z[ix[0]]
else: # weight z s by 1/dist --
w = 1 / dist**p
if weights is not None:
w *= weights[ix] # >= 0
w /= np.sum(w)
wz = np.dot( w, self.z[ix] )
if self.stat:
self.wn += 1
self.wsum += w
interpol[jinterpol] = wz
jinterpol += 1
return interpol if qdim > 1 else interpol[0]
#...............................................................................
if __name__ == "__main__":
import sys
N = 10000
Ndim = 2
Nask = N # N Nask 1e5: 24 sec 2d, 27 sec 3d on mac g4 ppc
Nnear = 8 # 8 2d, 11 3d => 5 % chance one-sided -- Wendel, mathoverflow.com
leafsize = 10
eps = .1 # approximate nearest, dist <= (1 + eps) * true nearest
p = 1 # weights ~ 1 / distance**p
cycle = .25
seed = 1
exec "\n".join( sys.argv[1:] ) # python this.py N= ...
np.random.seed(seed )
np.set_printoptions( 3, threshold=100, suppress=True ) # .3f
print "\nInvdisttree: N %d Ndim %d Nask %d Nnear %d leafsize %d eps %.2g p %.2g" % (
N, Ndim, Nask, Nnear, leafsize, eps, p)
def terrain(x):
""" ~ rolling hills """
return np.sin( (2*np.pi / cycle) * np.mean( x, axis=-1 ))
known = np.random.uniform( size=(N,Ndim) ) ** .5 # 1/(p+1): density x^p
z = terrain( known )
ask = np.random.uniform( size=(Nask,Ndim) )
#...............................................................................
invdisttree = Invdisttree( known, z, leafsize=leafsize, stat=1 )
interpol = invdisttree( ask, nnear=Nnear, eps=eps, p=p )
print "average distances to nearest points: %s" % \
np.mean( invdisttree.distances, axis=0 )
print "average weights: %s" % (invdisttree.wsum / invdisttree.wn)
# see Wikipedia Zipf's law
err = np.abs( terrain(ask) - interpol )
print "average |terrain() - interpolated|: %.2g" % np.mean(err)
# print "interpolate a single point: %.2g" % \
# invdisttree( known[0], nnear=Nnear, eps=eps )
【讨论】:
丹尼斯,之前你问我有多少点......我最多有几千个源点,所以不用担心。这很有帮助,谢谢! @majgis,不客气。在我的旧 mac g4 ppc 上,N=100000 Nask=100000 需要 24 秒 2d,27 秒 3d。 (对于将 2d 数据插值到统一网格中,matplotlib.delaunay 的速度要快 10 倍——请参阅scipy.org/Cookbook/Matplotlib/Gridding_irregularly_spaced_data) 查看警告here:“几乎在所有情况下,IDW 都是一个可怕的选择......”。尽管如此,IDW 可能对您的数据具有简单性、速度和流畅性的合理组合。 @denfromufa,creativecommons.org/licenses/by-nc-sa/3.0 非商业性。可以吗? @denis,那么它不能集成到scipy中【参考方案2】:编辑:@Denis 是对的,线性 Rbf(例如带有“function='linear'”的scipy.interpolate.Rbf)与 IDW 不同...
(请注意,如果您使用大量点,所有这些都会使用过多的内存!)
这是一个简单的 IDW 示例:
def simple_idw(x, y, z, xi, yi):
dist = distance_matrix(x,y, xi,yi)
# In IDW, weights are 1 / distance
weights = 1.0 / dist
# Make weights sum to one
weights /= weights.sum(axis=0)
# Multiply the weights for each interpolated point by all observed Z-values
zi = np.dot(weights.T, z)
return zi
然而,线性 Rbf 是这样的:
def linear_rbf(x, y, z, xi, yi):
dist = distance_matrix(x,y, xi,yi)
# Mutual pariwise distances between observations
internal_dist = distance_matrix(x,y, x,y)
# Now solve for the weights such that mistfit at the observations is minimized
weights = np.linalg.solve(internal_dist, z)
# Multiply the weights for each interpolated point by the distances
zi = np.dot(dist.T, weights)
return zi
(这里使用 distance_matrix 函数:)
def distance_matrix(x0, y0, x1, y1):
obs = np.vstack((x0, y0)).T
interp = np.vstack((x1, y1)).T
# Make a distance matrix between pairwise observations
# Note: from <http://***.com/questions/1871536>
# (Yay for ufuncs!)
d0 = np.subtract.outer(obs[:,0], interp[:,0])
d1 = np.subtract.outer(obs[:,1], interp[:,1])
return np.hypot(d0, d1)
将它们放在一个很好的复制粘贴示例中会产生一些快速比较图: (来源:jkington at www.geology.wisc.edu)(来源:jkington at www.geology.wisc.edu)(来源:jkington at www.geology.wisc.edu)
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import Rbf
def main():
# Setup: Generate data...
n = 10
nx, ny = 50, 50
x, y, z = map(np.random.random, [n, n, n])
xi = np.linspace(x.min(), x.max(), nx)
yi = np.linspace(y.min(), y.max(), ny)
xi, yi = np.meshgrid(xi, yi)
xi, yi = xi.flatten(), yi.flatten()
# Calculate IDW
grid1 = simple_idw(x,y,z,xi,yi)
grid1 = grid1.reshape((ny, nx))
# Calculate scipy's RBF
grid2 = scipy_idw(x,y,z,xi,yi)
grid2 = grid2.reshape((ny, nx))
grid3 = linear_rbf(x,y,z,xi,yi)
print grid3.shape
grid3 = grid3.reshape((ny, nx))
# Comparisons...
plot(x,y,z,grid1)
plt.title('Homemade IDW')
plot(x,y,z,grid2)
plt.title("Scipy's Rbf with function=linear")
plot(x,y,z,grid3)
plt.title('Homemade linear Rbf')
plt.show()
def simple_idw(x, y, z, xi, yi):
dist = distance_matrix(x,y, xi,yi)
# In IDW, weights are 1 / distance
weights = 1.0 / dist
# Make weights sum to one
weights /= weights.sum(axis=0)
# Multiply the weights for each interpolated point by all observed Z-values
zi = np.dot(weights.T, z)
return zi
def linear_rbf(x, y, z, xi, yi):
dist = distance_matrix(x,y, xi,yi)
# Mutual pariwise distances between observations
internal_dist = distance_matrix(x,y, x,y)
# Now solve for the weights such that mistfit at the observations is minimized
weights = np.linalg.solve(internal_dist, z)
# Multiply the weights for each interpolated point by the distances
zi = np.dot(dist.T, weights)
return zi
def scipy_idw(x, y, z, xi, yi):
interp = Rbf(x, y, z, function='linear')
return interp(xi, yi)
def distance_matrix(x0, y0, x1, y1):
obs = np.vstack((x0, y0)).T
interp = np.vstack((x1, y1)).T
# Make a distance matrix between pairwise observations
# Note: from <http://***.com/questions/1871536>
# (Yay for ufuncs!)
d0 = np.subtract.outer(obs[:,0], interp[:,0])
d1 = np.subtract.outer(obs[:,1], interp[:,1])
return np.hypot(d0, d1)
def plot(x,y,z,grid):
plt.figure()
plt.imshow(grid, extent=(x.min(), x.max(), y.max(), y.min()))
plt.hold(True)
plt.scatter(x,y,c=z)
plt.colorbar()
if __name__ == '__main__':
main()
【讨论】:
function="linear" 是 r,而不是 1/r。 (这有关系吗?function= 有六种选择,完全不同 - 有些适用于某些数据。) @Denis:它使用 1/function() 来加权。在这方面,文档可能会更清楚,但其他任何方式都没有任何意义。否则,离插值点较远的点的权重会更高,并且特定点附近的插值将具有最接近最远点的值。功能的选择很重要(很多!),而 IDW 通常是错误的选择。但是,目标是产生对进行插值的人来说似乎合理的结果,所以如果 IDW 有效,它就有效。无论如何,Scipy 的 Rbf 并没有给你太多的灵活性。 @joe,使用en.wikipedia.org/wiki/Radial_basis_function 表示:phi(r) 高斯和反多二次随距离减小,其他增加?! Rbf 在 cj 处精确求解 Aw = z(w 系数可以变化很大,打印 rbf.A rbf.nodes);那么对于 phi(r) = r, y(x) = sum( wj |x - cj| ),很奇怪。稍后会发布(mho of)反距离加权,然后我们可以比较 @Joe,很好的比较和情节,++。我们在某处单独写一篇关于“RBF 的优缺点”的说明怎么样? @Denis,谢谢!这些天我会的,但它必须等到预赛之后。 (研究生院有点浪费时间!)顺便说一句,我相信是你给我发了一封电子邮件,说是不久前关于 scipy 邮件列表的讨论的“插值咆哮”。对不起,我从来没有回复过!同样,这些天我会写一篇博客文章或类似的东西......【参考方案3】:我还需要一些快速的东西,我从@joerington 解决方案开始,最终在 numba 结束
我总是在 scipy、numpy 和 numba 之间进行试验,然后选择最好的一个。对于这个问题,我使用 numba,因为额外的 tmp 内存可以忽略不计,从而提供超快的速度。
使用 numpy 需要权衡内存和速度。例如,在 16GB 内存上,如果您想在其他 50000 个点上计算 50000 个点的插值,无论如何它都会耗尽内存或非常慢。
因此,为了节省内存,我们需要使用 for 循环来获得最小的临时内存分配。但是在 numpy 中编写 for 循环将意味着失去可能的向量化。为此,我们有麻木。您可以为在 numpy 上接受 for 循环的函数添加 numba jit,它将有效地矢量化硬件 + 多核上的额外并行性。它将为大型数组情况提供更好的速度,并且您可以在 GPU 上运行它而无需编写 cuda
一个非常简单的 sn-p 是计算距离矩阵,在 IDW 情况下,我们需要逆距离矩阵。但即使对于 IDW 以外的方法,您也可以执行类似的操作
还有关于计算斜边的自定义方法我有几个注意点here
@nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
def f2(d0, d1):
print('Numba with parallel')
res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
for i in nb.prange(d0.shape[0]):
for j in range(d1.shape[0]):
res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
return res
最近的 numba 也与 scikit 兼容,所以是 +1
参考:
Why np.hypot and np.subtract.outer very fast compared to vanilla broadcast ? Using Numba for speedup numpy in parallel for distance matrix calculation
Custom dtype in numpy for lattitude, longitude for faster distance matrix/krigging/IDW interpolation calculations
【讨论】:
以上是关于使用 Python 进行反距离加权 (IDW) 插值的主要内容,如果未能解决你的问题,请参考以下文章
数据可视化应用Python反距离权重(IDW)插值计算及可视化绘制
数据可视化应用IDW插值计算实战案例(附Python和R语言代码)
gis 用IDW插值 出现错误error 010092:invalid output extent,求解答,急!万分感谢