R中的正则表达式命名组
Posted
技术标签:
【中文标题】R中的正则表达式命名组【英文标题】:Regex named groups in R 【发布时间】:2018-01-29 19:41:39 【问题描述】:出于所有意图和目的,我是一名 Python 用户,每天都在使用 Pandas 库。正则表达式中的命名捕获组非常有用。因此,例如,提取特定单词或短语的出现并在数据帧的新列中生成结果的串联字符串是相对简单的。下面给出了如何实现这一点的示例:
import numpy as np
import pandas as pd
import re
myDF = pd.DataFrame(['Here is some text',
'We all love TEXT',
'Where is the TXT or txt textfile',
'Words and words',
'Just a few works',
'See the text',
'both words and text'],columns=['origText'])
print("Original dataframe\n------------------")
print(myDF)
# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)
# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)
# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)
myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)
这会产生以下输出:
Original dataframe
------------------
origText
0 Here is some text
1 We all love TEXT
2 Where is the TXT or txt textfile
3 Words and words
4 Just a few works
5 See the text
6 both words and text
Dataframe of matches (with multi-index)
--------------------
textOcc wordOcc
match
0 0 text NaN
1 0 TEXT NaN
2 0 TXT NaN
1 txt NaN
2 text NaN
3 0 NaN Word
1 NaN word
5 0 text NaN
6 0 NaN word
1 text NaN
Collapsed and concatenated matches
----------------------------------
textOcc wordOcc
0 text
1 TEXT
2 TXT///txt///text
3 Word///word
5 text
6 text word
Final joined dataframe
----------------------
origText textOcc wordOcc
0 Here is some text text
1 We all love TEXT TEXT
2 Where is the TXT or txt textfile TXT///txt///text
3 Words and words Word///word
4 Just a few works NaN NaN
5 See the text text
6 both words and text text word
我已经打印了每个阶段以使其易于理解。
问题是,我可以在 R 中做类似的事情吗?我在网上搜索过,但找不到任何描述命名组使用的内容(尽管我是 R 新手,所以可能正在搜索错误的库或描述性术语)。
我已经能够识别那些包含一个或多个匹配项的项目,但我看不到如何提取特定匹配项或如何使用命名组。我到目前为止的代码(使用与上面 Python 示例相同的数据框和正则表达式)是:
origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text" "We all love TEXT" "Where is the TXT or txt textfile" "Words and words"
[5] "See the text" "both words and text"
myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7
正则表达式似乎正在工作,并且正确的行被识别为包含匹配项(即上例中除第 5 行之外的所有行)。但是,我的问题是,我能否生成类似于 Python 生成的输出,其中提取特定匹配项并将其列在数据框中的新列中,这些列使用正则表达式中包含的组名命名?
【问题讨论】:
【参考方案1】:Base R 确实捕获了有关名称的信息,但它没有一个很好的帮手来按名称提取它们。我编写了一个名为regcapturedmatches 的包装器来提供帮助。您可以将其与
一起使用myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)
返回
textOcc wordOcc
[1,] "text" ""
[2,] "TEXT" ""
[3,] "TXT" ""
[4,] "" "Word"
[5,] "" ""
[6,] "text" ""
[7,] "" "word"
【讨论】:
绝对完美。感谢您分享您的代码。使用 regexpr() 似乎只返回第一个匹配项,但使用 gregexpr() 返回所有匹配项。正是我想要的。以上是关于R中的正则表达式命名组的主要内容,如果未能解决你的问题,请参考以下文章