干货｜全文检索Solr集成HanLP中文分词

Posted 2021-05-02 全球人工智能

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了干货｜全文检索Solr集成HanLP中文分词相关的知识，希望对你有一定的参考价值。

“全球人工智能”拥有十多万AI产业用户，核心用户来自：北大，清华，中科院，麻省理工，卡内基梅隆，斯坦福，哈佛，牛津，剑桥......以及谷歌，腾讯，百度，脸谱，微软，阿里，海康威视，英伟达......等全球名校和名企。

以前发布过HanLP的Lucene插件，后来很多人跟我说其实Solr更流行（反正我是觉得既然Solr是Lucene的子项目，那么稍微改改配置就能支持Solr），于是就抽空做了个Solr插件出来，开源在Github上，欢迎改进。HanLP中文分词solr插件支持Solr5.x，兼容Lucene5.x。

快速上手

将hanlp-portable.jar和hanlp-solr-plugin.jar共两个jar放入${webapp}/WEB-INF/lib下
修改solr core的配置文件${core}/conf/schema.xml：

 
   
   
 
  
    
    
    <fieldType name="text_cn" class="solr.TextField">
  
    
    
        <analyzer type="index">
  
    
    
            <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>
  
    
    
        </analyzer>
  
    
    
        <analyzer type="query">
  
    
    
            <!-- 切记不要在query中开启index模式 -->
  
    
    
            <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="false"/>
  
    
    
        </analyzer>
  
    
    
    </fieldType>
  
    
    
    <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
  
    
    
    <field name="my_field1" type="text_cn" indexed="true" stored="true"/>
  
    
    
    <field name="my_field2" type="text_cn" indexed="true" stored="true"/>

Solr5中文分词器详细配置

对于新手来说，上面的两步可能太简略了，不如看看下面的step by step。本教程使用Solr5.2.1，理论上兼容solr5.x。

放置jar

将上述两个jar放到solr-5.2.1/server/solr-webapp/webapp/WEB-INF/lib目录下。

启动solr

首先在solr-5.2.1\bin目录下启动solr：

 
   
   
 
  
    
    
  solr start -f

用浏览器打开http://localhost:8983/solr/#/，看到如下页面说明一切正常：

干货｜全文检索Solr集成HanLP中文分词

创建core

在solr-5.2.1\server\solr下新建一个目录，取个名字比如叫one，将示例配置文件solr-5.2.1\server\solr\configsets\sample_techproducts_configs\conf拷贝过来，接着修改schema.xml中的默认域type，搜索

 
   
   
 
  
    
    
      <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  
    
    
          ...
  
    
    
      </fieldType>

替换为

 
   
   
 
  
    
    
      <!-- 默认文本类型: 指定使用HanLP分词器，同时开启索引模式。
  
    
    
    通过solr自带的停用词过滤器，使用"stopwords.txt"（默认空白）过滤。
  
    
    
    在搜索的时候，还支持solr自带的同义词词典。-->
  
    
    
      <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  
    
    
        <analyzer type="index">
  
    
    
          <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>
  
    
    
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
  
    
    
          <!-- 取消注释可以启用索引期间的同义词词典
  
    
    
          <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
  
    
    
          -->
  
    
    
          <filter class="solr.LowerCaseFilterFactory"/>
  
    
    
        </analyzer>
  
    
    
        <analyzer type="query">
  
    
    
          <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>
  
    
    
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
  
    
    
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
  
    
    
          <filter class="solr.LowerCaseFilterFactory"/>
  
    
    
        </analyzer>
  
    
    
      </fieldType>

意思是默认文本字段类型启用HanLP分词器，text_general还开启了solr默认的各种filter。

solr允许为不同的字段指定不同的分词器，由于绝大部分字段都是text_general类型的，可以说这种做法比较适合新手。如果你是solr老手的话，你可能会更喜欢单独为不同的字段指定不同的分词器及其他配置。如果你的业务系统中有其他字段，比如location，summary之类，也需要一一指定其type="text_general"。切记，否则这些字段仍旧是solr默认分词器，会造成这些字段“搜索不到”。

另外，切记不要在query中开启indexMode，否则会影响PhaseQuery。indexMode只需在index中开启一遍即可，要不然它怎么叫indexMode呢。

如果你不需要solr提供的停用词、同义词等filter，如下配置可能更适合你：

 
   
   
 
  
    
    
    <fieldType name="text_cn" class="solr.TextField">
  
    
    
        <analyzer type="index">
  
    
    
            <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>
  
    
    
        </analyzer>
  
    
    
        <analyzer type="query">
  
    
    
            <!-- 切记不要在query中开启index模式 -->
  
    
    
            <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="false"/>
  
    
    
        </analyzer>
  
    
    
    </fieldType>
  
    
    
    <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
  
    
    
    <field name="my_field1" type="text_cn" indexed="true" stored="true"/>
  
    
    
    <field name="my_field2" type="text_cn" indexed="true" stored="true"/>