使用Lucene 7 OpenNLP查询词性标签
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用Lucene 7 OpenNLP查询词性标签相关的知识,希望对你有一定的参考价值。
为了娱乐和学习,我正在尝试使用OpenNLP和Lucene 7.4构建一个词性(POS)标记器。目标是,一旦索引,我实际上可以搜索一系列POS标签并找到所有与序列匹配的句子。我已经获得了索引部分,但我仍然坚持查询部分。我知道SolR可能有一些功能,我已经检查了代码(毕竟不是那么自我解释)。但我的目标是在Lucene 7中理解和实现,而不是在SolR中,因为我希望独立于任何搜索引擎。
想法输入句子1:快速的棕色狐狸跳过懒狗。应用Lucene OpenNLP标记器导致:[[]] [快] [棕色] [狐狸] [跳跃] [结束] [] [懒惰] [狗] [。]接下来,应用Lucene OpenNLP POS标记导致:[DT] [JJ] [JJ] [NN] [VBD] [IN] [DT] [JJ] [NNS] [。]
输入句子2:给我,宝贝! Applied Lucene OpenNLP tokenizer导致:[Give] [it] [to] [me] [,] [baby] [!]接下来,应用Lucene OpenNLP POS标记会导致:[VB] [PRP] [TO] [PRP] [,] [UH] [。]
查询:JJ NN VBD匹配句子1的一部分,因此应返回句子1。 (此时我只对完全匹配感兴趣,即让我们将部分匹配,通配符等放在一边)
索引首先,我创建了自己的类com.example.OpenNLPAnalyzer:
public class OpenNLPAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String fieldName) {
try {
ResourceLoader resourceLoader = new ClasspathResourceLoader(ClassLoader.getSystemClassLoader());
TokenizerModel tokenizerModel = OpenNLPOpsFactory.getTokenizerModel("en-token.bin", resourceLoader);
NLPTokenizerOp tokenizerOp = new NLPTokenizerOp(tokenizerModel);
SentenceModel sentenceModel = OpenNLPOpsFactory.getSentenceModel("en-sent.bin", resourceLoader);
NLPSentenceDetectorOp sentenceDetectorOp = new NLPSentenceDetectorOp(sentenceModel);
Tokenizer source = new OpenNLPTokenizer(
AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, sentenceDetectorOp, tokenizerOp);
POSModel posModel = OpenNLPOpsFactory.getPOSTaggerModel("en-pos-maxent.bin", resourceLoader);
NLPPOSTaggerOp posTaggerOp = new NLPPOSTaggerOp(posModel);
// Perhaps we should also use a lower-case filter here?
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
// Very important: Tokens are not indexed, we need a store them as payloads otherwise we cannot search on them
TypeAsPayloadTokenFilter payloadFilter = new TypeAsPayloadTokenFilter(posFilter);
return new TokenStreamComponents(source, payloadFilter);
}
catch (IOException e) {
throw new RuntimeException(e.getMessage());
}
}
请注意,我们使用的是围绕OpenNLPPOSFilter的TypeAsPayloadTokenFilter。这意味着,我们的POS标签将被索引为有效载荷,而我们的查询 - 无论它看起来如何 - 也必须搜索有效载荷。
查询这是我被困的地方。我不知道如何查询有效负载,无论我尝试什么都行不通。请注意,我使用的是Lucene 7,似乎在旧版本中查询有效负载已多次更改。文档非常稀缺。现在还不清楚现在要查询的正确字段名称 - 是“单词”还是“类型”还是其他什么?例如,我尝试了此代码,但不返回任何搜索结果:
// Step 1: Indexing
final String body = "The quick brown fox jumped over the lazy dogs.";
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
Document document = new Document();
document.add(new TextField("body", body, Field.Store.YES));
writer.addDocument(document);
writer.close();
// Step 2: Querying
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
final String fieldName = "body"; // What is the correct field name here? "body", or "type", or "word" or anything else?
final String queryText = "JJ";
Term term = new Term(fieldName, queryText);
SpanQuery match = new SpanTermQuery(term);
BytesRef pay = new BytesRef("type"); // Don't understand what to put here as an argument
SpanPayloadCheckQuery query = new SpanPayloadCheckQuery(match, Collections.singletonList(pay));
System.out.println(query.toString());
TopDocs topDocs = searcher.search(query, topN);
这里非常感谢任何帮助。
为什么不使用TypeAsSynonymFilter而不是TypeAsPayloadTokenFilter而只是进行普通查询。所以在您的分析器中:
:
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
TypeAsSynonymFilter typeAsSynonymFilter = new TypeAsSynonymFilter(posFilter);
return new TokenStreamComponents(source, typeAsSynonymFilter);
和索引方:
static Directory index() throws Exception {
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
writer.addDocument(doc("The quick brown fox jumped over the lazy dogs."));
writer.addDocument(doc("Give it to me, baby!"));
writer.close();
return index;
}
static Document doc(String body){
Document document = new Document();
document.add(new TextField(FIELD, body, Field.Store.YES));
return document;
}
并寻找方:
static void search(Directory index, String searchPhrase) throws Exception {
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(FIELD, new WhitespaceAnalyzer());
Query query = parser.parse(searchPhrase);
System.out.println(query);
TopDocs topDocs = searcher.search(query, topN);
System.out.printf("%s => %d hits
", searchPhrase, topDocs.totalHits);
for(ScoreDoc scoreDoc: topDocs.scoreDocs){
Document doc = searcher.doc(scoreDoc.doc);
System.out.printf(" %s
", doc.get(FIELD));
}
}
然后像这样使用它们:
public static void main(String[] args) throws Exception {
Directory index = index();
search(index, ""JJ NN VBD""); // search the sequence of POS tags
search(index, ""brown fox""); // search a phrase
search(index, ""fox brown""); // search a phrase (no hits)
search(index, "baby"); // search a word
search(index, ""TO PRP""); // search the sequence of POS tags
}
结果如下:
body:"JJ NN VBD"
"JJ NN VBD" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"brown fox"
"brown fox" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"fox brown"
"fox brown" => 0 hits
body:baby
baby => 1 hits
Give it to me, baby!
body:"TO PRP"
"TO PRP" => 1 hits
Give it to me, baby!
以上是关于使用Lucene 7 OpenNLP查询词性标签的主要内容,如果未能解决你的问题,请参考以下文章
GitGit 标签使用 ( 查询哈希码 | 创建标签 git tag v1.0 | 查询标签 git tag | 查询标签信息 git show v1.0 | 创建标签并指定说明 | 删除标签 )(代
GitGit 标签使用 ( 查询哈希码 | 创建标签 git tag v1.0 | 查询标签 git tag | 查询标签信息 git show v1.0 | 创建标签并指定说明 | 删除标签 )(代