Centos下Sphinx中文分词编译安装测试---CoreSeek
Posted 路漫漫其修远兮
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Centos下Sphinx中文分词编译安装测试---CoreSeek相关的知识,希望对你有一定的参考价值。
要支持中文分词,还需要下载Coreseek,可以去官方搜索下载,这里我用的4.1
百度云下载地址: https://pan.baidu.com/s/1slNIyHf
tar -zxvf coreseek-4.1-beta.tar.gz cd coreseek-4.1-beta cd mmseg-3.2.14/ ./bootstrap //测试安装环境
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `config\'. libtoolize: copying file `config/ltmain.sh\' libtoolize: Consider adding `AC_CONFIG_MACRO_DIR([m4])\' to configure.in and libtoolize: rerunning libtoolize, to keep the correct libtool macros in-tree. libtoolize: Consider adding `-I m4\' to ACLOCAL_AMFLAGS in Makefile.am. + autoheader + automake --add-missing --copy + autoconf
./configure --prefix=/usr/local/mmseg3
------------------------------------------------------------------------ Configuration: Source code location: . Compiler: gcc Compiler flags: -g -O2 Host System Type: x86_64-redhat-linux-gnu Install path: /usr/local/mmseg3 See config.h for further configuration information. ------------------------------------------------------------------------
make && make install
在原安装目录下创建一个文本文档测试一下
cd /usr/local/mmseg3 cd /usr/local/src/coreseek-4.1-beta/mmseg-3.2.14/src vim test.txt 山东省德州市 北京朝阳市 中国北京 中国德州 中国山东德州
cd /usr/local/mmsge3/bin ./mmseg -d /usr/local/mmseg3/etc/ /usr/local/src/coreseek-4.1-beta/mmseg-3.2.14/src/test.txt
山东省/x 德州市/x /x /x 北京/x 朝阳市/x 中国/x 北京/x 中国/x 德州/x 中国/x 山东/x 德州/x Word Splite took: 0 ms.
cd /usr/local/src/coreseek-4.1-beta/csft-4.1 //可以把csft当做sphinx了 sh buildconf.sh //执行脚本测试,如果不出问题,证明可以使用 ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/ /include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql
You can now run \'make install\' to build and install Sphinx binaries. On a multi-core machine, try \'make -j4 install\' to speed up the build. Updates, articles, help forum, and commercial support, consulting, training, and development services are available at http://sphinxsearch.com/ Thank you for choosing Sphinx!
make && make install
make[3]: Entering directory `/usr/local/src/coreseek-4.1-beta/csft-4.1\' mkdir -p /usr/local/coreseek/var/data && mkdir -p /usr/local/coreseek/var/log make[3]: Leaving directory `/usr/local/src/coreseek-4.1-beta/csft-4.1\' make[2]: Leaving directory `/usr/local/src/coreseek-4.1-beta/csft-4.1\' make[1]: Leaving directory `/usr/local/src/coreseek-4.1-beta/csft-4.1\'
然后进入mysql客户端创建一个表测试一下
create table kecheng(id int primary key auto_increment,name varchar(50),info varchar(50))charset utf8; insert into kecheng(name,info) values(\'java\',\'java是一门很牛的语言,性能整体来说比php要强,但是不如php开发速度快\'); insert into kecheng(name,info) values(\'redis\',\'redis是一种内存缓存数据库,比memcache支持的数据格式多\'); insert into kecheng(name,info) values(\'memcache\',\'memcache支持简单的key value形式,不像redis支持持久化\'); insert into kecheng(name,info) values(\'jquery\',\'jquery是一种前端脚本,结合php和java可以做web开发\');
cd /usr/local/coreseek/ //也就是sphinx目录了 cd bin ls //类似于原版sphinx目录结构 cd /usr/local/coreseek/etc cp sphinx.conf.dist csft.conf
CREATE TABLE index_table( //此表为了存放更新完的索引id,不用每次更新全表 Counter_id int unsigned not null primary key auto, Max_id int unsigned not null comment\'已经创建完索引的最大id\' )
编辑配置文件csft.conf
13 source src1 14 { 15 # data source type. mandatory, no default value 16 # known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc 17 type = mysql --库类型 18 19 ##################################################################### 20 ## SQL settings (for \'mysql\' and \'pgsql\' types) 21 ##################################################################### 22 23 # some straightforward parameters for SQL source types 24 sql_host = localhost --不做解释 25 sql_user = root 26 sql_pass = 27 sql_db = test 28 sql_port = 3306 # optional, default is 3306 ..... 79 sql_query_pre = SET NAMES utf8 --设置字符集 80 sql_query_pre = SET SESSION query_cache_type=OFF --关闭mysql查询缓存 84 # mandatory, integer document ID field MUST be the first selected column 85 #sql_query = \\ 86 # SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \\ 87 # FROM documents--关掉默认的查询表 #设置要查询的信息,如果表主键不叫id,那么还需要别名为id,如 select tid id from tableName; 88 sql_query = SELECT id,name,info FROM kecheng #主查询执行完之后执行的SQL index_table是存放最后更新的主键id,不用每次更新全表,只更新最新数据 sql_query_post = REPLACE INTO index_table SELECT 1,MAX(id) FROM kecheng; ..... #当使用search检索文件的时候,返回的记录字段,这里是所有(测试而已) 241 sql_query_info = SELECT * FROM kecheng WHERE id=$id ..... index test1 318 { ..... 331 path = /usr/local/coreseek/var/data/test1 --索引文件创建的位置 332 333 # document attribute values (docinfo) storage mode 391 charset_type = zh_cn.utf-8 --改为中文 392 charset_dictpath = /usr/local/mmseg3/etc/ --词典目录 #---------------- source zengliangsuoyin : src1{ #取出还没有创建索引的数据 sql_query = SELECT id,name,info FROM kecheng WHERE id > (SELECT max_id FROM index_table ) #再把最后一个id更新到index_table 。。不用写了,因为是继承上一个 } index zengliangsuoyin : src1{ source = zengliangsuoyin path = /usr/local/coreseek/var/data/test1 }
保存退出
cd /usr/local/coreseek/bin/ ./indexer --all
using config file \'/usr/local/coreseek/etc/csft.conf\'... --指定的配置文档,之前复制的文件命名一致 indexing index \'test1\'... WARNING: attribute \'group_id\' not found - IGNORING WARNING: attribute \'date_added\' not found - IGNORING WARNING: Attribute count is 0: switching to none docinfo collected 5 docs, 0.0 MB sorted 0.0 Mhits, 100.0% done total 5 docs, 351 bytes total 0.178 sec, 1971 bytes/sec, 28.07 docs/sec indexing index \'test1stemmed\'... WARNING: attribute \'group_id\' not found - IGNORING WARNING: attribute \'date_added\' not found - IGNORING WARNING: Attribute count is 0: switching to none docinfo collected 5 docs, 0.0 MB --发现五个文档也就是mysql五条记录,连接库没问题了 sorted 0.0 Mhits, 100.0% done total 5 docs, 351 bytes total 0.007 sec, 47677 bytes/sec, 679.16 docs/sec skipping non-plain index \'dist1\'... skipping non-plain index \'rt\'... total 4 reads, 0.000 sec, 0.3 kb/call avg, 0.0 msec/call avg total 12 writes, 0.000 sec, 0.2 kb/call avg, 0.0 msec/call avg
./search php
Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)] Copyright (c) 2007-2011, Beijing Choice Software Technologies Inc (http://www.coreseek.com) using config file \'/usr/local/coreseek/etc/csft.conf\'... index \'test1\': query \'php \': returned 3 matches of 3 total in 0.000 sec displaying matches: 1. document=1, weight=2500 id=1 group_id=1 group_id2=5 date_added=2017-02-08 06:22:36 title=test one content=this is my test document number one. also checking search within phrases. 2. document=2, weight=1500 id=2 group_id=1 group_id2=6 date_added=2017-02-08 06:22:36 title=test two content=this is my test document number two 3. document=5, weight=1500 (document not found in db) words: 1. \'php\': 3 documents, 5 hits ---出现的次数 index \'test1stemmed\': query \'php \': returned 3 matches of 3 total in 0.000 sec displaying matches: 1. document=1, weight=2500 id=1 group_id=1 group_id2=5 date_added=2017-02-08 06:22:36 title=test one content=this is my test document number one. also checking search within phrases. 2. document=2, weight=1500 id=2 group_id=1 group_id2=6 date_added=2017-02-08 06:22:36 title=test two content=this is my test document number two 3. document=5, weight=1500 (document not found in db) words: 1. \'php\': 3 documents, 5 hits
测试完成,下面就开始php扩展的安装了
以上是关于Centos下Sphinx中文分词编译安装测试---CoreSeek的主要内容,如果未能解决你的问题,请参考以下文章
PHP实现关键词全文搜索Sphinx及中文分词Coreseek的安装配置
整理Linux下中文检索引擎coreseek4安装,以及PHP使用sphinx的三种方式(sphinxapi,sphinx的php扩展,SphinxSe作为mysql存储引擎)