闭包作为数据合并习语的解决方案

Posted

技术标签:

【中文标题】闭包作为数据合并习语的解决方案【英文标题】:Closures as solution to data merging idiom 【发布时间】:2011-12-09 11:35:00 【问题描述】:

我正试图围绕闭包展开思考,我认为我发现了一个可能有用的案例。

我有以下工作要做:

一组用于清理状态名称的正则表达式,位于一个函数中 具有状态名称(上述函数创建的标准化形式)和状态 ID 代码的 data.frame,用于链接两者(“合并图”)

这个想法是,给定一些带有草率州名的data.frame(首都是否列为“华盛顿特区”、“华盛顿特区”、“哥伦比亚特区”等?),只有一个函数返回删除了州名称列且仅保留州 ID 代码的相同 data.frame。然后可以持续进行后续合并。

我可以通过多种方式做到这一点,但一种似乎特别优雅的方式是将合并映射和正则表达式和代码处理在闭包内的所有内容(遵循闭包是数据函数)。

问题 1:这是一个合理的想法吗?

问题 2:如果是这样,我如何在 R 中做到这一点?

这是一个愚蠢的简单干净状态名称函数,适用于示例数据:

cleanStateNames <- function(x) 
  x <- tolower(x)
  x[grepl("columbia",x)] <- "DC"
  x

以下是最终函数将在其上运行的一些示例数据:

dat <- structure(list(state = c("Alabama", "Alaska", "Arizona", "Arkansas", 
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia", 
"Florida"), pop08 = structure(c(29L, 44L, 40L, 18L, 25L, 30L, 
22L, 48L, 36L, 13L), .Label = c("1,050,788", "1,288,198", "1,315,809", 
"1,316,456", "1,523,816", "1,783,432", "1,814,468", "1,984,356", 
"10,003,422", "11,485,910", "12,448,279", "12,901,563", "18,328,340", 
"19,490,297", "2,600,167", "2,736,424", "2,802,134", "2,855,390", 
"2,938,618", "24,326,974", "3,002,555", "3,501,252", "3,642,361", 
"3,790,060", "36,756,666", "4,269,245", "4,410,796", "4,479,800", 
"4,661,900", "4,939,456", "5,220,393", "5,627,967", "5,633,597", 
"5,911,605", "532,668", "591,833", "6,214,888", "6,376,792", 
"6,497,967", "6,500,180", "6,549,224", "621,270", "641,481", 
"686,293", "7,769,089", "8,682,661", "804,194", "873,092", "9,222,414", 
"9,685,744", "967,440"), class = "factor")), .Names = c("state", 
"pop08"), row.names = c(NA, 10L), class = "data.frame")

还有一个示例合并图(实际的是将 FIPS 代码链接到状态,因此不能轻易生成):

merge_map <- data.frame(state=dat$state, id=seq(10) )

编辑在下面 crippledlambda 的回答的基础上,这里是对该函数的尝试:

prepForMerge <- local(
  merge_map <- structure(list(state = c("alabama", "alaska", "arizona", "arkansas",  "california", "colorado", "connecticut", "delaware", "DC", "florida" ), id = 1:10), .Names = c("state", "id"), row.names = c(NA, -10L ), class = "data.frame")
  list(
    replace_merge_map=function(new_merge_map) 
      merge_map <<- new_merge_map
    ,
    show_merge_map=function() 
      merge_map
    ,
    return_prepped_data.frame=function(dat) 
      dat$state <- cleanStateNames(dat$state)
      dat <- merge(dat,merge_map)
      dat <- subset(dat,select=c(-state))
      dat
    
  )
)

> prepForMerge$return_prepped_data.frame(dat)
        pop08 id
1   4,661,900  1
2     686,293  2
3   6,500,180  3
4   2,855,390  4
5  36,756,666  5
6   4,939,456  6
7   3,501,252  7
8     591,833  9
9     873,092  8
10 18,328,340 10

在我认为这个问题解决之前还有两个问题:

    每次调用prepForMerge$return_prepped_data.frame(dat) 都很痛苦。有什么方法可以让我调用 prepForMerge(dat) 的默认函数?我猜没有给出它是如何实现的,但也许至少有一个默认 fxn 的约定....

    如何避免在 merge_map 定义中混合数据和代码?理想情况下,我会在其他地方清理 merge_map,然后将其抓住并存储在闭包内。

【问题讨论】:

对我来说,这似乎不是一个使用闭包的自然场所。 @hadley 您能否将其发布为答案(因为它在问题 1 的范围内),也许需要详细说明是什么使得这不是关闭的理想情况(是数据不'真的不需要更新吗?)?我在闭包的“如何”方面做得很好,但在“为什么/何时”使用它们时遇到了困难。 【参考方案1】:

我可能错过了您的问题的重点,但这是您可以使用闭包的一种方式:

> replaceStateNames <- local(
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   function(patt,newtext) 
+     statenames <- tolower(statenames)
+     statenames[grepl(patt,statenames)] <- newtext
+     statenames
+   
+ )
> 
> replaceStateNames("columbia","DC")
 [1] "alabama"     "alaska"      "arizona"     "arkansas"    "california" 
 [6] "colorado"    "connecticut" "delaware"    "DC"          "florida"    
> replaceStateNames("alaska","palincountry")
 [1] "alabama"              "palincountry"         "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "florida"             
> replaceStateNames("florida","jebbushland")
 [1] "alabama"              "alaska"               "arizona"             
 [4] "arkansas"             "california"           "colorado"            
 [7] "connecticut"          "delaware"             "district of columbia"
[10] "jebbushland"    
> 

但概括地说,您可以将statenames 替换为您的数据框定义,并返回一个使用此数据框的函数(或函数列表),而无需将其作为参数传递给函数调用。示例(但请注意,我在 grepl 中使用了 ignore.case=TRUE 参数):

> replaceStateNames <- local(
+   statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas", 
+                   "California", "Colorado", "Connecticut", "Delaware",
+                   "District of Columbia", "Florida")
+   list(justreturn=function(patt,newtext) 
+     statenames[grepl(patt,statenames,ignore.case=TRUE)] <- newtext
+     statenames
+   ,reassign=function(patt,newtext) 
+     statenames <<- replace(statenames,grepl(patt,statenames,ignore.case=TRUE),newtext)
+     statenames
+   )
+ )

就像第一个例子:

> replaceStateNames$justreturn("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

只返回 statenames 的词法范围值以检查原始值是否未更改:

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"              "Alaska"               "Arizona"             
 [4] "Arkansas"             "California"           "Colorado"            
 [7] "Connecticut"          "Delaware"             "District of Columbia"
[10] "Florida"             

做同样的事情,但使更改“永久”:

> replaceStateNames$reassign("columbia","DC")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

请注意,附加到这些函数的statenames 的值已更改。

> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
 [1] "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California" 
 [6] "Colorado"    "Connecticut" "Delaware"    "DC"          "Florida"    

在任何情况下,您都可以将statenames 替换为数据框,并将这些简单的函数替换为“合并映射”或您想要的任何其他映射。

编辑

说到“合并”,这就是你要找的吗?第一个使用闭包的?merge 示例的实现:

> authors <- data.frame(surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+                       nationality = c("US", "Australia", "US", "UK", "Australia"),
+                       deceased = c("yes", rep("no", 4)))
> books <- data.frame(name = I(c("Tukey", "Venables", "Tierney",
+                       "Ripley", "Ripley", "McNeil", "R Core")),
+                     title = c("Exploratory Data Analysis",
+                       "Modern Applied Statistics ...",
+                       "LISP-STAT",
+                       "Spatial Statistics", "Stochastic Simulation",
+                       "Interactive Data Analysis",
+                       "An Introduction to R"),
+                     other.author = c(NA, "Ripley", NA, NA, NA, NA,
+                       "Venables & Smith"))
> 
> mergewithauthors <- with(list(authors=authors),function(books) 
+   merge(authors, books, by.x = "surname", by.y = "name"))
> 
> mergewithauthors(books)
   surname nationality deceased                         title other.author
1   McNeil   Australia       no     Interactive Data Analysis         <NA>
2   Ripley          UK       no            Spatial Statistics         <NA>
3   Ripley          UK       no         Stochastic Simulation         <NA>
4  Tierney          US       no                     LISP-STAT         <NA>
5    Tukey          US      yes     Exploratory Data Analysis         <NA>
6 Venables   Australia       no Modern Applied Statistics ...       Ripley

编辑 2

要将文件读入将被词法绑定的对象,您可以这样做

fn <- local(
  data <- read.csv("filename.csv")
  function(...) 
    ...
  
)

fn <- with(list(data=read.csv("filename.csv")),
     function(...) 
       ...
     
   )

fn <- with(local(data <- read.csv("filename.csv")),
     function(...) 
       ...
     
   )

等等。 (我假设函数(...)将与您的“merge_map”有关)。您也可以使用evalq 代替local。要“引入”驻留在全局空间(或封闭环境)中的对象,您只需执行以下操作

globalobj <- value      ## could be from read.csv()
fn <- local(
  localobj <- globalobj ## if globalobj is not locally defined, 
                        ## R will look in enclosing environment
                        ## in this case, the globalenv()
  function(...) 
    ...
  
)

然后修改globalobj 不会更改附加到函数的localobj(因为几乎(?)R 中的所有内容都遵循按值传递的语义)。您也可以使用with 代替local,如上例所示。

【讨论】:

很棒的答案。谢谢你。您正在将数据作为代码加载。我假设从您刚刚 read.csv 的文件或闭包中的任何文件中执行此操作,但是如果您想将对象的副本存储到该函数中呢?例如。 merge_map 已经在前面的代码中创建了,你想把它引入闭包吗?

以上是关于闭包作为数据合并习语的解决方案的主要内容,如果未能解决你的问题,请参考以下文章

通过 std::unique_ptr 的 LazyArray 模板,这是双重检查习语的正确实现吗?

是否可以使用接收数据作为参数的函数的闭包范围?

前端性能优化方案

英文谚语:Take that with a grain of salt

Kotlin函数 ⑨ ( Kotlin 语言中的闭包概念 | Java 语言中函数作为参数的替代方案 )

什么是模板<typename T, T t> 习语?