Java全文检索Lucene急速入门知识

Posted 2021-05-02 三更编程菌

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Java全文检索Lucene急速入门知识相关的知识，希望对你有一定的参考价值。

1、流程分析

Java全文检索Lucene急速入门知识

1、绿色表示索引过程，对要搜索的原始内容进行索引构建一个索引库，索引过程包括：确定原始内容即要搜索的内容 ---> 采集文档 ---> 创建文档 ---> 分析文档 ---> 索引文档。

2、红色表示搜索过程，从索引库中搜索内容，搜索过程包括：用户通过搜索界面 ---> 创建查询 ---> 执行搜索，从索引库搜索 ---> 渲染搜索结果

2、生成索引看库

2.1、获得原始文档

原始文档，简单地说就是我们准备搜索的内容。可以是互联网上的网页、数据库中的数据、磁盘上的文件等。

2.2、创建文档对象

在创建索引前，我们需要将原始内容创建成文档（Document），一个文档中会包括很多的的域（Field），然后在域中存储原始文档的一些内容。每一个文件可以当成一个document，Document下面可以包含多个Field(域)，域由两部分组成：域的名称和值，可以理解为以key和value的形式存放数据。其实还有一个类型，用来表示域中存放数据的类型，这个不用我们考虑，可以在调用方法时作区分。比如我们现在以磁盘上面的文件为例，那么每个文件下可以包含四个域：name（名称）、path（路径）、context（内容）、size（文件大小）。如下所示：

名称	值
name	Helloworld.java
path	f://Helloworld.java
context	public class HelloWorld...
size	4562

注意：每个文档都有一个唯一的编号，就是文档id。

Field域还有三个属性，如下所示：

是否分析：是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。
是否索引：将Field分析后的词或整个Field值进行索引，只有索引的内容才能搜索到。比如：商品名称、商品简介分析后进行索引，订单号、身份证号不用分析但也要索引，这些将来都要作为查询条件。
是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取比如：商品名称、订单号，凡是将来要从Document中获取的Field都要存储。

总结下来就是，是否查询这些内容决定了是否进行分析，时候进行搜索决定了是否索引，是否眼展示内容决定了是否进行存储。

附上一张表，用来展示各个方法对这三个属性的条件。

2.3、分析文档

说白了这就是一个分词的操作，就是挑选出关键词，去除没用的。比如根据空格进行字符串的拆分得到关键词列表、将关键词中的字母转为小写、去除标点符号、去除停用词（即需要过滤掉的词汇）等。

原文档内容：Lucene is a Java full-text search engine. Lucene is not a complete application, but rather a code library and API that can easily be used to add search capabilities to applications.

分析后得到的语汇单元：lucene、java、full、search、engine。。。。

分析得到的每个关键词封装成一个Term对象，Term中包含两部分：一部分是关键词所在的域名，另一部分是关键词的内容。不同的域中拆分出来的相同的关键词是不同的term。

2.4、创建索引

说了那么多，我们先进行导包操作，需要导入下面这些包，按照从左往右的顺序依次为：文件输入输出、汉语词法分析、Lucene包。

     
       
       
     
      
        
        
      commons-io、IK-Analyzer-1.0-SNAPSHOT、lucene-analyzers-common-7.4.0、lucene-core-7.4.0、lucene-queryparser-7.4.0

接着来看一下创建索引的方法：

     
       
       
     
      
        
        
      public void createIndex() throws Exception {
      
        
        
       //1、创建一个Director对象，指定索引库保存的位置。
      
        
        
       //把索引库保存在内存中
      
        
        
       //Directory directory = new RAMDirectory();
      
        
        
       //把索引库保存在磁盘
      
        
        
       Directory directory = FSDirectory.open(new File("F:\\index").toPath());
      
        
        
       //2、基于Directory对象创建一个IndexWriter对象
      
        
        
       IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
      
        
        
       IndexWriter indexWriter = new IndexWriter(directory, config);
      
        
        
       //3、读取磁盘上的文件，对应每个文件创建一个文档对象。
      
        
        
       File dir = new File("F:\\texts");
      
        
        
       File[] files = dir.listFiles();
      
        
        
       for (File f : files) {
      
        
        
       //取文件名
      
        
        
       String fileName = f.getName();
      
        
        
       //文件的路径
      
        
        
       String filePath = f.getPath();
      
        
        
       //文件的内容
      
        
        
       String fileContent = FileUtils.readFileToString(f, "utf-8");
      
        
        
       //文件的大小
      
        
        
       long fileSize = FileUtils.sizeOf(f);
      
        
        
       //创建Field
      
        
        
       //参数1：域的名称，参数2：域的内容，参数3：是否存储
      
        
        
       Field fieldName = new TextField("name", fileName, Field.Store.YES);
      
        
        
       Field fieldPath = new StoredField("path", filePath);
      
        
        
       Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
      
        
        
       Field fieldSizeValue = new LongPoint("size", fileSize);
      
        
        
       Field fieldSizeStore = new StoredField("size", fileSize);
      
        
        
       //创建文档对象
      
        
        
       Document document = new Document();
      
        
        
       //向文档对象中添加域
      
        
        
       document.add(fieldName);
      
        
        
       document.add(fieldPath);
      
        
        
       document.add(fieldContent);
      
        
        
       //document.add(fieldSize);
      
        
        
       document.add(fieldSizeValue);
      
        
        
       document.add(fieldSizeStore);
      
        
        
       //5、把文档对象写入索引库
      
        
        
       indexWriter.addDocument(document);
      
        
        
       }
      
        
        
       //6、关闭indexwriter对象
      
        
        
       indexWriter.close();
      
        
        
       }

3、查询索引

3.1、关键词查询

代码实现上上面几乎相同，最基本就是根据关键词查询内容，如下所示：

     
       
       
     
      
        
        
      public void searchIndex() throws Exception {
      
        
        
       //1、创建一个Director对象，指定索引库的位置
      
        
        
       Directory directory = FSDirectory.open(new File("F:\\index").toPath());
      
        
        
       //2、创建一个IndexReader对象
      
        
        
       IndexReader indexReader = DirectoryReader.open(directory);
      
        
        
       //3、创建一个IndexSearcher对象，构造方法中的参数indexReader对象。
      
        
        
       IndexSearcher indexSearcher = new IndexSearcher(indexReader);
      
        
        
       //4、创建一个Query对象，TermQuery
      
        
        
       Query query = new TermQuery(new Term("name", "spring"));
      
        
        
       //5、执行查询，得到一个TopDocs对象
      
        
        
       //参数1：查询对象 参数2：查询结果返回的最大记录数
      
        
        
       TopDocs topDocs = indexSearcher.search(query, 10);
      
        
        
       //6、取查询结果的总记录数
      
        
        
       System.out.println("查询总记录数：" + topDocs.totalHits);
      
        
        
       //7、取文档列表
      
        
        
       ScoreDoc[] scoreDocs = topDocs.scoreDocs;
      
        
        
       //8、打印文档中的内容
      
        
        
       for (ScoreDoc doc :
      
        
        
       scoreDocs) {
      
        
        
       //取文档id
      
        
        
       int docId = doc.doc;
      
        
        
       //根据id取文档对象
      
        
        
       Document document = indexSearcher.doc(docId);
      
        
        
       System.out.println(document.get("name"));
      
        
        
       System.out.println(document.get("path"));
      
        
        
       System.out.println(document.get("size"));
      
        
        
       }
      
        
        
       //9、关闭IndexReader对象
      
        
        
       indexReader.close();
      
        
        
       }

3.2、语句查询

当然我们也可以用文本，例如一句话去查询，去查询哪些文件里面包含这句话。那么代码该怎么操作呢？

     
       
       
     
      
        
        
      private void printResult(Query query) throws Exception {
      
        
        
       IndexReader indexReader = 
      
        
        
       DirectoryReader.open(FSDirectory.open(new File("F:\\index").toPath()));
      
        
        
       IndexSearcher indexSearcher = new IndexSearcher(indexReader);
      
        
        
       //创建一个QueryPaser对象，两个参数。参数1：默认搜索域，参数2：分析器对象
      
        
        
       QueryParser queryParser = new QueryParser("name", new IKAnalyzer());
      
        
        
       //使用QueryPaser对象创建一个Query对象
      
        
        
       Query query = queryParser.parse("lucene是一个Java开发的全文检索工具包");
      
        
        
       //执行查询，查询出10条数据
      
        
        
       TopDocs topDocs = indexSearcher.search(query, 10);
      
        
        
       System.out.println("总记录数：" + topDocs.totalHits);
      
        
        
       ScoreDoc[] scoreDocs = topDocs.scoreDocs;
      
        
        
       for (ScoreDoc doc:scoreDocs){
      
        
        
       //取文档id
      
        
        
       int docId = doc.doc;
      
        
        
       //根据id取文档对象
      
        
        
       Document document = indexSearcher.doc(docId);
      
        
        
       System.out.println(document.get("name"));
      
        
        
       System.out.println(document.get("path"));
      
        
        
       System.out.println(document.get("size"));
      
        
        
       }
      
        
        
       indexReader.close();
      
        
        
      }

3.1、范围查询

我们也可以根据范围进行查询，如下所示：

     
       
       
     
      
        
        
      //创建一个Query对象,查询size的范围，最小值01，最大值1001
      
        
        
      Query query = LongPoint.newRangeQuery("size", 0l, 100l);
      
        
        
      printResult(query);

4、维护索引

4.1、添加索引

     
       
       
     
      
        
        
      public void addDocument() throws Exception {
      
        
        
       //创建一个IndexWriter对象，需要使用IKAnalyzer作为分析器
      
        
        
       IndexWriter indexWriter =
      
        
        
       new IndexWriter(FSDirectory.open(new File("F:\\index").toPath()),
      
        
        
       new IndexWriterConfig(new IKAnalyzer()));
      
        
        
       //创建一个Document对象
      
        
        
       Document document = new Document();
      
        
        
       //向document对象中添加域
      
        
        
       document.add(new TextField("name", "新添加的文件", Field.Store.YES));
      
        
        
       document.add(new TextField("content", "新添加的文件内容", Field.Store.NO));
      
        
        
       document.add(new StoredField("path", "新添加的文件路径"));
      
        
        
       // 把文档写入索引库
      
        
        
       indexWriter.addDocument(document);
      
        
        
       //关闭索引库
      
        
        
       indexWriter.close();
      
        
        
       }

4.2、删除索引

     
       
       
     
      
        
        
      public void deleteAllDocument() throws Exception {
      
        
        
       //删除全部文档
      
        
        
       indexWriter.deleteAll();
      
        
        
       //关闭索引库
      
        
        
       indexWriter.close();
      
        
        
       }
      
        
        
      //根据条件删除索引
      
        
        
      public void deleteDocumentByQuery() throws Exception {
      
        
        
       indexWriter.deleteDocuments(new Term("name", "apache"));
      
        
        
       indexWriter.close();
      
        
        
      }

4.3、更新索引

原理就是先删除后添加。

     
       
       
     
      
        
        
      public void updateDocument() throws Exception {
      
        
        
       //创建一个新的文档对象
      
        
        
       Document document = new Document();
      
        
        
       //向文档对象中添加域
      
        
        
       document.add(new TextField("name", "更新之后的文档", Field.Store.YES));
      
        
        
       document.add(new TextField("name1", "更新之后的文档2", Field.Store.YES));
      
        
        
       document.add(new TextField("name2", "更新之后的文档3", Field.Store.YES));
      
        
        
       //更新操作,要删除的对象
      
        
        
       indexWriter.updateDocument(new Term("name", "spring"), document);
      
        
        
       //关闭索引库
      
        
        
       indexWriter.close();
      
        
        
      }

讲了这么多，大家是不是感觉很流弊的样子，但是要告诉大家一点，实际开发过程中，一般不会使用Lucene进行开发，我们会选择Elasticsearch之类的检索引擎，这个底层也是Lucene实现的，大家可以先了解一下Lucene，后面我们再聊一下ES~

纯干货

零水分

三更编程菌

我在这里等你哟！

以上是关于Java全文检索Lucene急速入门知识的主要内容，如果未能解决你的问题，请参考以下文章