Lucene 的 StopFilter 中使用的默认停用词列表是啥？

Posted 2023-03-12

技术标签:

【中文标题】Lucene 的 StopFilter 中使用的默认停用词列表是啥？【英文标题】：What is the default list of stopwords used in Lucene's StopFilter?Lucene 的 StopFilter 中使用的默认停用词列表是什么？ 【发布时间】：2013-07-05 20:24:47 【问题描述】：

Lucene 有一个默认的停止过滤器 (http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html)，有谁知道列表中的单词是什么？

【问题讨论】：

【参考方案1】：

StandardAnalyzer 和 EnglishAnalyzer 中设置的 default stop words 来自 StopAnalyzer.ENGLISH_STOP_WORDS_SET，如在 source file 中找到的：

"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"

StopFilter 本身没有定义默认的停用词集。

【讨论】：

我正在使用Lucene 5.5.0 来获取关键字。我用tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET); 指定停用词过滤器，但Lucene 不过滤停用词。有什么我想念的吗？

以上是关于Lucene 的 StopFilter 中使用的默认停用词列表是啥？的主要内容，如果未能解决你的问题，请参考以下文章

Elasticsearch：Keep words token 过滤器

J：牛顿方法的默示副词

Lucene全文搜索原理与使用

lucene中分词和索引的区别

Lucene参与项目持久层中对于索引库的增删改查