创建大型数据集的邻域列表/紧固

Posted 2023-04-18

技术标签:

【中文标题】创建大型数据集的邻域列表/紧固【英文标题】：Create neighborhood list of large dataset / fasten up 【发布时间】：2018-01-02 11:40:19 【问题描述】：

我想根据距离创建一个权重矩阵。我目前的代码如下所示，并且适用于较小的数据样本。但是，对于大型数据集（24077 个位置的 569424 个人），它不会通过。问题出现在 nb2blocknb 函数中。所以我的问题是：如何针对大型数据集优化我的代码？

# load all survey data
DHS <- read.csv("Daten/final.csv")
attach(DHS)

# define coordinates matrix
coormat <- cbind(DHS$location, DHS$lon_s, DHS$lat_s)
coorm <- cbind(DHS$lon_s, DHS$lat_s)
colnames(coormat) <- c("location", "lon_s", "lat_s")
coo <- cbind(unique(coormat))
c <-  as.data.frame(coo)
coor <- cbind(c$lon_s, c$lat_s)

# get a list with beneighbored locations thath are inbetween 50 km distance
neighbor <- dnearneigh(coor, d1 = 0, d2 = 50, row.names=c$location,  longlat=TRUE, bound=c("GE", "LE"))

# get neighborhood list on individual level
nb <- nb2blocknb(neighbor, as.character(DHS$location)))

# weight matrix in list format
nbweights.lw <- nb2listw(nb, style="B", zero.policy=TRUE)

非常感谢您的帮助！

【问题讨论】：

一些相关问答：How to assign several names to lat-lon observations和Geographic distance between 2 lists of lat/lon coordinates dnearneigh 和 nb2blocknb 函数从何而来？请同时指定使用的包。它们来自 spdep 包 【参考方案1】：

您正在尝试进行 1.3 e10 距离计算。结果将以 GB 为单位。

我认为您希望限制最大距离或您正在寻找的最近邻居的数量。从RANN 包中尝试nn2： library('RANN') nearest_neighbours_w_distance<-nn2(coordinatesA, coordinatesB,10)

请注意，此操作不是对称的（切换坐标 A 和坐标 B 会产生不同的结果）。

您还必须首先将您的 gps 坐标转换为可以计算欧几里得距离的坐标参考系统，例如 UTM（代码未测试）：

   library("sp")
   gps2utm<-function(gps_coordinates_matrix,utmzone)
      spdf<-SpatialPointsDataFrame(gps_coordinates_matrix[,1],gps_coordinates_matrix[,2])     
      proj4string(spdf) <- CRS("+proj=longlat +datum=WGS84")  
      return(spTransform(spdf, CRS(paste0("+proj=utm +zone=",utmzone," ellps=WGS84"))))

【讨论】：

以上是关于创建大型数据集的邻域列表/紧固的主要内容，如果未能解决你的问题，请参考以下文章

在 Java Spark 中迭代大型数据集的最快且有效的方法

从大型数据集的数据框有效地创建矩阵

大数据集的市场桶分析

在 phpMyAdmin SQL 表中存储大型数据集的有效方法

大型数据集的一种热编码

具有大型（70,000+ 项）数据集的高效 jQuery 实时搜索