CountVectorizer token_pattern 不捕捉下划线

Posted 2023-03-12

技术标签:

【中文标题】CountVectorizer token_pattern 不捕捉下划线【英文标题】：CountVectorizer token_pattern to not catch underscore 【发布时间】：2021-08-23 16:03:38 【问题描述】：

CountVectorizer 默认标记模式将下划线定义为字母

corpus = ['The rain in spain_stays' ]
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

给予：

['in', 'rain', 'spain_stays', 'the']

这是有道理的，因为 AFAIK '/w' 等同于 [a-zA-z0-9_]，我需要的是：

['in', 'rain', 'spain', 'stays', 'the']

所以我尝试用 [a-zA-Z0-9] 替换“/w”

vectorizer = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9][a-zA-Z0-9]+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

但我明白了

['in', 'rain', 'the']

我怎样才能得到我需要的东西？欢迎任何想法

【问题讨论】：

\w 也匹配 _ 所以这两个字符之间没有单词边界n_ 那么我可以使用什么来代替 '/w' 来获得所需的结果？没有单词边界，您可以使用例如[^\W_]+ regex101.com/r/zN3Oax/1 或者使用lookarounds形式的边界(?:(?<=[\s_])|(?<=^))[^\W_]+(?=[\s_]|$)regex101.com/r/QaREpI/1 谢谢，工作。两者有区别吗？ 【参考方案1】：

n_ 之间没有单词边界，因为\w 也匹配下划线。

匹配 2 个或多个不带下划线的单词字符，并且只允许左右两边有空格或下划线：

(?<![^\s_])[^\W_]2,(?![^\s_])

模式匹配：

(?<![^\s_]) 否定后视，断言左边的空白边界或下划线 [^\W_]2, 匹配单词字符 2 次或多次，不包括下划线 (?![^\s_]) 负前瞻，在右侧断言空白边界或下划线

查看regex demo。

一个非常广泛的匹配可能是[^\W_]2,，但请注意，这没有考虑边界。它只匹配不带下划线的单词字符。

在此regex demo 中查看不同数量的匹配项。

【讨论】：

以上是关于CountVectorizer token_pattern 不捕捉下划线的主要内容，如果未能解决你的问题，请参考以下文章