R语言| 学一点stringr与正则表达式
Posted R语言与SPSS
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了R语言| 学一点stringr与正则表达式相关的知识,希望对你有一定的参考价值。
########### 基于R语言的自动收集
##### 正则表达式和基本字符串函数
library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## √ ggplot2 2.2.1 √ purrr 0.2.4
## √ tibble 1.4.2 √ dplyr 0.7.4
## √ tidyr 0.7.2 √ stringr 1.2.0
## √ readr 1.1.1 √ forcats 0.2.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
raw.data <- "555-1234Moe Szyslak(636) 555-0113Burns,C. Montgomery555-6542Rev.Timothy Lovejoy555 8904Ned Flanders 636-555-3226Simpson,Homer5553642 DR. Julius Hibbert "
#选出人名,用[:alpha:],表示a-z和A-Z,即大小写字母
#只选字母
#str_extract_all提取的是列表形式,我们用unlist转为向量,适合在实际问题中单变量的提取
unlist(str_extract_all(raw.data,"[:alpha:]"))
## [1] "M" "o" "e" "S" "z" "y" "s" "l" "a" "k" "B" "u" "r" "n" "s" "C" "M"
## [18] "o" "n" "t" "g" "o" "m" "e" "r" "y" "R" "e" "v" "T" "i" "m" "o" "t"
## [35] "h" "y" "L" "o" "v" "e" "j" "o" "y" "N" "e" "d" "F" "l" "a" "n" "d"
## [52] "e" "r" "s" "S" "i" "m" "p" "s" "o" "n" "H" "o" "m" "e" "r" "D" "R"
## [69] "J" "u" "l" "i" "u" "s" "H" "i" "b" "b" "e" "r" "t"
#选择人名,用*的效果,*表示{0,},将非目标字符视为空值,但是也匹配出来
#结果是将数字字符视为空值,匹配出来了
unlist(str_extract_all(raw.data,"[:alpha:]*"))
## [1] "" "" "" "" ""
## [6] "" "" "" "Moe" ""
## [11] "Szyslak" "" "" "" ""
## [16] "" "" "" "" ""
## [21] "" "" "" "" ""
## [26] "Burns" "" "C" "" ""
## [31] "Montgomery" "" "" "" ""
## [36] "" "" "" "" "Rev"
## [41] "" "Timothy" "" "Lovejoy" ""
## [46] "" "" "" "" ""
## [51] "" "" "Ned" "" "Flanders"
## [56] "" "" "" "" ""
## [61] "" "" "" "" ""
## [66] "" "" "" "Simpson" ""
## [71] "Homer" "" "" "" ""
## [76] "" "" "" "" "DR"
## [81] "" "" "Julius" "" "Hibbert"
## [86] "" ""
#用+号,+表示{1,0},即只选择目标字符
#结果中没有数字字符的空值,成功提取了所有的单词,但是逻辑上没有连在一起
unlist(str_extract_all(raw.data,"[:alpha:]+"))
## [1] "Moe" "Szyslak" "Burns" "C" "Montgomery"
## [6] "Rev" "Timothy" "Lovejoy" "Ned" "Flanders"
## [11] "Simpson" "Homer" "DR" "Julius" "Hibbert"
#为什么没有连在一起呢?因为名字的中间有空格!
#加入空格,我用的是空格键打出来的,也可以用\s
#同时注意,我们匹配的内容是 字母加空格,所以就需要有两个[],不能在[:alpha:]直接加入
unlist(str_extract_all(raw.data,"[[:alpha:] ]+"))
## [1] "Moe Szyslak" " " "Burns"
## [4] "C" " Montgomery" "Rev"
## [7] "Timothy Lovejoy" " " "Ned Flanders "
## [10] "Simpson" "Homer" " DR"
## [13] " Julius Hibbert "
#在用到\s代表空格的时候,就遇到了转义字符的问题
#简单明说,就是在R中,必须类似\s和\d的前面,再加上一个\,组成\\s才能准确表达空格这个含义,否则R不认识,就会error
unlist(str_extract_all(raw.data,"[[:alpha:]\\s]+"))
## [1] "Moe Szyslak" " " "Burns"
## [4] "C" " Montgomery" "Rev"
## [7] "Timothy Lovejoy" " " "Ned Flanders "
## [10] "Simpson" "Homer" " DR"
## [13] " Julius Hibbert "
#这次的结果,我们发现Moe Szyslak 和Timothy Lovejoy的结果是我们想要的,但是其他的还没达到我们想要的结果,问题在哪?
#因为我们发现有的名字,如Burns,C. Montgomery中,既有,号也有.号,观察一下,的确是因为这两个符号导致了分割
#所以,我们下一步的思路是 将,号和.号,一起和空格加入到内容的匹配中
unlist(str_extract_all(raw.data,"[[:alpha:]\\s,.]+"))
## [1] "Moe Szyslak" " " "Burns,C. Montgomery"
## [4] "Rev.Timothy Lovejoy" " " "Ned Flanders "
## [7] "Simpson,Homer" " DR. Julius Hibbert "
#结果中的内容是我们想要的,但是为什么会有空值出现呢?
#检查文本 555-1234Moe Szyslak(636) 555-0113Burns,C. Montgomery,我们发现在(636)和555之间有一个空格
#将原文本的空格删掉
raw.data2 <- "555-1234Moe Szyslak(636)555-0113Burns,C. Montgomery555-6542Rev.Timothy Lovejoy555 8904Ned Flanders 636-555-3226Simpson,Homer5553642 DR. Julius Hibbert "
unlist(str_extract_all(raw.data2,"[[:alpha:]\\s,.]+"))
## [1] "Moe Szyslak" "Burns,C. Montgomery" "Rev.Timothy Lovejoy"
## [4] " " "Ned Flanders " "Simpson,Homer"
## [7] " DR. Julius Hibbert "
#我们就明显发现原来结果中的第二个空值没有了,但实际出来问题中这样做是不行的,要改进一下正则的写法,或者按照字符数排除
#有空格,感觉是因为匹配的内容里有,但是怎么限制空格呢?
#将+改为{2,}
unlist(str_extract_all(raw.data,"[[:alpha:]. ,]{2,}"))
## [1] "Moe Szyslak" "Burns,C. Montgomery" "Rev.Timothy Lovejoy"
## [4] "Ned Flanders " "Simpson,Homer" " DR. Julius Hibbert "
#或者在原来的基础上用nchar函数做消除
name <- unlist(str_extract_all(raw.data2,"[[:alpha:]\\s,.]+"))
name
## [1] "Moe Szyslak" "Burns,C. Montgomery" "Rev.Timothy Lovejoy"
## [4] " " "Ned Flanders " "Simpson,Homer"
## [7] " DR. Julius Hibbert "
name[nchar(name)>2]
## [1] "Moe Szyslak" "Burns,C. Montgomery" "Rev.Timothy Lovejoy"
## [4] "Ned Flanders " "Simpson,Homer" " DR. Julius Hibbert "
以上是关于R语言| 学一点stringr与正则表达式的主要内容,如果未能解决你的问题,请参考以下文章