将参数传递给R中的多个match_fun函数fuzzyjoin::fuzzy_join
Posted
技术标签:
【中文标题】将参数传递给R中的多个match_fun函数fuzzyjoin::fuzzy_join【英文标题】:Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join 【发布时间】:2017-11-07 02:05:09 【问题描述】:我正在回答这些twoquestions 并得到了适当的解决方案,但我无法将使用fuzzy_join
的参数传递到我从fuzzyjoin::stringdist_join
提取的match_fun 中。在这种情况下,我混合使用了多个 match_fun,包括自定义的 match_fun_stringdist
以及 ==
和 <=
,用于精确匹配和标准匹配。
我得到的错误信息是:
# Error in mf(rep(u_x, n_y), rep(u_y, each = n_x), ...): object 'ignore_case' not found
# Data:
library(data.table, quietly = TRUE)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')
data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]
# Solution:
library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
# First, need to define match_fun_stringdist
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin/blob/master/R/stringdist_join.R
match_fun_stringdist <- function(v1, v2, ...)
if (ignore_case)
v1 <- stringr::str_to_lower(v1)
v2 <- stringr::str_to_lower(v2)
dists <- stringdist::stringdist(v1, v2, method = method, ...)
ret <- dplyr::data_frame(include = (dists <= max_dist))
if (!is.null(distance_col))
ret[[distance_col]] <- dists
ret
# Call fuzzy_join
fuzzy_join(data1, data2,
by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")),
match_fun = list(match_fun_stringdist, `==`, `<=`),
mode = "left",
ignore_case = FALSE,
method = "dl",
max_dist = 99,
distance_col = "dist"
) %>%
group_by(Address1, Year1, AREACODE) %>%
top_n(1, -Address1.dist) %>%
top_n(1, Year2) %>%
select(unique_id, Address1.dist, everything())
#> Error in mf(rep(u_x, n_y), rep(u_y, each = n_x), ...): object 'ignore_case' not found
【问题讨论】:
【参考方案1】:我认为错误是因为传递给每个多个 match_fun 的参数搞砸了,即不能将额外的参数(如 ignore_case
,最初用于 string_dist match_fun)传递到 >=
的 match_fun 中
解决方案是使用固定参数定义我自己的 match_fun。请参阅下面我用固定参数定义自己的 match_fun_stringdist 的地方。我也在另一个问题/答案https://***.com/a/44383103/4663008 中实现了它。
# First, need to define match_fun_stringdist
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2)
# Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
ignore_case = FALSE
method = "dl"
max_dist = 99
distance_col = "dist"
if (ignore_case)
v1 <- stringr::str_to_lower(v1)
v2 <- stringr::str_to_lower(v2)
# shortcut for Levenshtein-like methods: if the difference in
# string length is greater than the maximum string distance, the
# edit distance must be at least that large
# length is much faster to compute than string distance
if (method %in% c("osa", "lv", "dl"))
length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
include <- length_diff <= max_dist
dists <- rep(NA, length(v1))
dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
else
# have to compute them all
dists <- stringdist::stringdist(v1, v2, method = method)
ret <- dplyr::data_frame(include = (dists <= max_dist))
if (!is.null(distance_col))
ret[[distance_col]] <- dists
ret
并调用fuzzy_join
fuzzy_join(data1, data2,
by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")),
match_fun = list(match_fun_stringdist, `==`, `<=`),
mode = "left")
【讨论】:
我得到同样的错误。此问题记录在这里:github.com/dgrtwo/fuzzyjoin/issues/50 事实证明,当我使用反引号而不是单引号时,我可以正常工作。我在 github 线程中添加了额外的 cmets,它显示了我的示例代码。 好的,我确实使用了反引号;但是,就我而言,我试图提供一个复杂的 match_fun 列表(match_fun = list(match_fun_stringdist,==
, <=
)
如果没有在 match_fun_stringdist
中定义 max_dist
,你是如何让它工作的?我收到了错误Error in eval_tidy(xs[[i]], unique_output) : object 'max_dist' not found
,即使我像您在模糊加入调用中所做的那样定义它。
我正在尝试做类似的事情,我认为“函数工厂”可能是一个不错的选择,因此您不必编写一堆类似的自定义函数。 adv-r.hadley.nz/function-factories.html以上是关于将参数传递给R中的多个match_fun函数fuzzyjoin::fuzzy_join的主要内容,如果未能解决你的问题,请参考以下文章
将参数传递给从 R 中的字符串调用的用户定义函数的最佳方法是啥?