如何立即纠正R中的拼写错误列表

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何立即纠正R中的拼写错误列表相关的知识,希望对你有一定的参考价值。

我有一个完整的拼写错误清单,我想一次改变所有内容。是否有一种简单的方法而无需编写大量的ifelse语句?

vegas <-  c("North Las Vegas","N Las Vegas", "LAS VEGAS", "Las vegas","N. Las Vegas", "las vegas", "Las  Vegas", "Las Vegas ", "South Las Vegas", "La Vegas", "Las Vegas, NV", "LasVegas",
"110 Las Vegas", "C Las Vegas", "Henderson and Las vegas",
"las Vegas", "Las Vegas & Henderson", "Las Vegas East", "Las Vegas Nevada", 
"Las Vegas NV", "Las Vegas Valley", "Las Vegas,", "Las Vegass", 
"Las Vergas", "Los Vegas", "N E Las Vegas", "N W Las Vegas", "NORTH LAS VEGAS", "North Las Vegas ", "Vegas")

data <- structure(list(city = c("Las Vegas", "Henderson", "North Las Vegas", 
"Boulder City", "N Las Vegas", "Paradise", "LAS VEGAS", "Nellis AFB", 
"Las vegas", "Blue Diamond", "N. Las Vegas", "Summerlin", "Spring Valley", 
"HENDERSON", "las vegas", "Enterprise", "Las  Vegas", "Clark", 
"Las Vegas ", "Nellis Air Force Base", "South Las Vegas", "henderson", 
"Nellis Afb", "La Vegas", "Las Vegas, NV", "LasVegas", "Summerlin South", 
"110 Las Vegas", "Black Rock City", "boulder city", "C Las Vegas", 
"Centennial Hills", "Central Henderson", "Citibank", "City Center", 
"Decatur", "Green Valley", "Henderson (Green Valley)", "Henderson and Las vegas", 
"Henderston", "Hendserson", "Hnederson", "Lake Las Vegas", "Lake Mead", 
"las Vegas", "Las Vegas & Henderson", "Las Vegas East", "Las Vegas Nevada", 
"Las Vegas NV", "Las Vegas Valley", "Las Vegas,", "Las Vegass", 
"Las Vergas", "Los Vegas", "N E Las Vegas", "N W Las Vegas", 
"Nellis", "NELLIS AFB", "Nevada", "NORTH LAS VEGAS", "North Las Vegas ", 
"Pahrump", "Seven Hills", "Sunrise", "Sunrise Manor", "Vegas", 
"W Henderson", "W Spring Valley", "Whitney"), count = c(29361L, 
4892L, 1547L, 269L, 26L, 24L, 19L, 16L, 14L, 12L, 12L, 11L, 9L, 
8L, 8L, 7L, 5L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, -69L), class = c("tbl_df", 
"tbl", "data.frame"))

因此在每个错误拼写的行中正确拼写到“ Las Vegas”

答案

以下是与提议的mgsub方法非常相似的解决方案(具有基本的R函数)(也许您可能希望将Lake Las Vegas添加到列表中:]

vegas <-  c("North Las Vegas","N Las Vegas", "LAS VEGAS", "Las vegas","N. Las Vegas", "las vegas", "Las  Vegas", "Las Vegas ", "South Las Vegas", "La Vegas", "Las Vegas, NV", "LasVegas",
    "110 Las Vegas", "C Las Vegas", "Henderson and Las vegas",
    "las Vegas", "Las Vegas & Henderson", "Las Vegas East", "Las Vegas Nevada", 
    "Las Vegas NV", "Las Vegas Valley", "Las Vegas,", "Las Vegass", 
    "Las Vergas", "Los Vegas", "N E Las Vegas", "N W Las Vegas", "NORTH LAS VEGAS", "North Las Vegas ", "Vegas")

data <- structure(list(city = c("Las Vegas", "Henderson", "North Las Vegas", 
    "Boulder City", "N Las Vegas", "Paradise", "LAS VEGAS", "Nellis AFB", 
    "Las vegas", "Blue Diamond", "N. Las Vegas", "Summerlin", "Spring Valley", 
    "HENDERSON", "las vegas", "Enterprise", "Las  Vegas", "Clark", 
    "Las Vegas ", "Nellis Air Force Base", "South Las Vegas", "henderson", 
    "Nellis Afb", "La Vegas", "Las Vegas, NV", "LasVegas", "Summerlin South", 
    "110 Las Vegas", "Black Rock City", "boulder city", "C Las Vegas", 
    "Centennial Hills", "Central Henderson", "Citibank", "City Center", 
    "Decatur", "Green Valley", "Henderson (Green Valley)", "Henderson and Las vegas", 
    "Henderston", "Hendserson", "Hnederson", "Lake Las Vegas", "Lake Mead", 
    "las Vegas", "Las Vegas & Henderson", "Las Vegas East", "Las Vegas Nevada", 
    "Las Vegas NV", "Las Vegas Valley", "Las Vegas,", "Las Vegass", 
    "Las Vergas", "Los Vegas", "N E Las Vegas", "N W Las Vegas", 
    "Nellis", "NELLIS AFB", "Nevada", "NORTH LAS VEGAS", "North Las Vegas ", 
    "Pahrump", "Seven Hills", "Sunrise", "Sunrise Manor", "Vegas", 
    "W Henderson", "W Spring Valley", "Whitney"), count = c(29361L, 
        4892L, 1547L, 269L, 26L, 24L, 19L, 16L, 14L, 12L, 12L, 11L, 9L, 
        8L, 8L, 7L, 5L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, -69L), class = c("tbl_df", 
            "tbl", "data.frame"))

## function that takes list with two elements and replaces first with second
multisub <- function(replacement.list, string, ...) {
    mygsub <- function(l, x) gsub(pattern = l[1], replacement = l[2], x, ...)
    Reduce(mygsub, replacement.list, init = string, right = TRUE)
}

## make sure the matches correspond to entire string by adding delimiters
vegas <- paste0("^", vegas, "$")

## generate replacement list
mylist <- unlist(apply(cbind(vegas, rep("Las Vegas", length(vegas))), 1, list), recursive = FALSE)

## perform multiple replacement
data$city_replaced <- multisub(mylist, data$city)
data
#>                        city count            city_replaced
#> 1                 Las Vegas 29361                Las Vegas
#> 2                 Henderson  4892                Henderson
#> 3           North Las Vegas  1547                Las Vegas
#> 4              Boulder City   269             Boulder City
#> 5               N Las Vegas    26                Las Vegas
#> 6                  Paradise    24                 Paradise
#> 7                 LAS VEGAS    19                Las Vegas
#> 8                Nellis AFB    16               Nellis AFB
#> 9                 Las vegas    14                Las Vegas
#> 10             Blue Diamond    12             Blue Diamond
#> 11             N. Las Vegas    12                Las Vegas
#> 12                Summerlin    11                Summerlin
#> 13            Spring Valley     9            Spring Valley
#> 14                HENDERSON     8                HENDERSON
#> 15                las vegas     8                Las Vegas
#> 16               Enterprise     7               Enterprise
#> 17               Las  Vegas     5                Las Vegas
#> 18                    Clark     4                    Clark
#> 19               Las Vegas      4                Las Vegas
#> 20    Nellis Air Force Base     4    Nellis Air Force Base
#> 21          South Las Vegas     4                Las Vegas
#> 22                henderson     3                henderson
#> 23               Nellis Afb     3               Nellis Afb
#> 24                 La Vegas     2                Las Vegas
#> 25            Las Vegas, NV     2                Las Vegas
#> 26                 LasVegas     2                Las Vegas
#> 27          Summerlin South     2          Summerlin South
#> 28            110 Las Vegas     1                Las Vegas
#> 29          Black Rock City     1          Black Rock City
#> 30             boulder city     1             boulder city
#> 31              C Las Vegas     1                Las Vegas
#> 32         Centennial Hills     1         Centennial Hills
#> 33        Central Henderson     1        Central Henderson
#> 34                 Citibank     1                 Citibank
#> 35              City Center     1              City Center
#> 36                  Decatur     1                  Decatur
#> 37             Green Valley     1             Green Valley
#> 38 Henderson (Green Valley)     1 Henderson (Green Valley)
#> 39  Henderson and Las vegas     1                Las Vegas
#> 40               Henderston     1               Henderston
#> 41               Hendserson     1               Hendserson
#> 42                Hnederson     1                Hnederson
#> 43           Lake Las Vegas     1           Lake Las Vegas
#> 44                Lake Mead     1                Lake Mead
#> 45                las Vegas     1                Las Vegas
#> 46    Las Vegas & Henderson     1                Las Vegas
#> 47           Las Vegas East     1                Las Vegas
#> 48         Las Vegas Nevada     1                Las Vegas
#> 49             Las Vegas NV     1                Las Vegas
#> 50         Las Vegas Valley     1                Las Vegas
#> 51               Las Vegas,     1                Las Vegas
#> 52               Las Vegass     1                Las Vegas
#> 53               Las Vergas     1                Las Vegas
#> 54                Los Vegas     1                Las Vegas
#> 55            N E Las Vegas     1                Las Vegas
#> 56            N W Las Vegas     1                Las Vegas
#> 57                   Nellis     1                   Nellis
#> 58               NELLIS AFB     1               NELLIS AFB
#> 59                   Nevada     1                   Nevada
#> 60          NORTH LAS VEGAS     1                Las Vegas
#> 61         North Las Vegas      1                Las Vegas
#> 62                  Pahrump     1                  Pahrump
#> 63              Seven Hills     1              Seven Hills
#> 64                  Sunrise     1                  Sunrise
#> 65            Sunrise Manor     1            Sunrise Manor
#> 66                    Vegas     1                Las Vegas
#> 67              W Henderson     1              W Henderson
#> 68          W Spring Valley     1          W Spring Valley
#> 69                  Whitney     1                  Whitney

reprex package(v0.3.0)在2020-03-10创建

编辑:使用上述方法,您可以追加多个替换列表并立即替换它们。尽管我们已在此处使用vegas <- paste0("^", vegas, "$")明确将其关闭,但它也允许部分匹配。

[如果您只有一个城市,并有其他拼写形式的列表,也可以简单地将它们匹配并替换它们(使用原始的data data.frame和vegas向量):

data$city[data$city %in% vegas] <- "Las Vegas"
另一答案

我不完全理解您的示例,但是您可以使用Levenshtein距离检查是否有相近的匹配项(例如拼写错误)。有关R中的

以上是关于如何立即纠正R中的拼写错误列表的主要内容,如果未能解决你的问题,请参考以下文章

java 英文单词纠正校验框架(Word Checker)

SQL - 希望使用 soundex 来纠正拼写错误 [关闭]

如何纠正 R 函数中的变异和过滤错误

SharePoint Search之Query spelling correction— 查询拼写纠正

002-贝叶斯拼写纠正实例

使用拼写检查提高Tesseract OCR准确性