抛弃mysql模糊查询,使用sphinx做专业索引
Posted Liberal-man
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了抛弃mysql模糊查询,使用sphinx做专业索引相关的知识,希望对你有一定的参考价值。
Sphinx是一个基于SQL的全文检索引擎,可以结合mysql,PostgreSQL做全文搜索,提供比数据库本身更专业的搜索功能特别为MySQL也设计了一个存储引擎插件,从此抛弃模糊查询吧。
Sphinx 单一索引最大可包含1亿条记录,在1千万条记录情况下的查询速度为0.x秒(毫秒级)。Sphinx创建100万条记录的索引只要 3、4分钟,创建1000万条记录的索引可以在50分钟内完成,而重建一次只包含最新10万条记录的增量索引只需几十秒。
一、安装
环境:centos6.5
yum install sphinx -y
默认配置路径 /etc/sphinx/ ,在该路径下,有配置文件sphinx.conf,看看我的配置
# 数据源,这里配置的是mysql
source src1
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = beego_blog
sql_port = 3306 # optional, default is 3306
# 创建索引时候,从数据库查询数据的SQL
sql_query = \\
SELECT id, userid, UNIX_TIMESTAMP(posttime) AS posttime, title, content, tags \\
FROM tb_post
sql_attr_uint = userid
sql_attr_timestamp = posttime
sql_query_info = SELECT * FROM tb_post WHERE id=$id
# 索引1
index test1
# 指定数据源
source = src1
# 索引文件路径
path = /var/lib/sphinx/test1
# 储文档信息的方式 extern
docinfo = extern
charset_type = sbcs
# 索引2
index testrt
type = rt
rt_mem_limit = 32M
path = /var/lib/sphinx/testrt
charset_type = utf-8
rt_field = title
rt_field = content
rt_attr_uint = userid
indexer
mem_limit = 32M
searchd
listen = 0.0.0.0:9312 # 索引对外提供服务的地址
listen = 9306:mysql41
log = /var/log/sphinx/searchd.log
query_log = /var/log/sphinx/query.log
read_timeout = 5
max_children = 30
pid_file = /var/run/sphinx/searchd.pid
max_matches = 1000
seamless_rotate = 1
preopen_indexes = 1
unlink_old = 1
workers = threads # for RT to work
binlog_path = /var/lib/sphinx
生成索引,这里我们用上文配置的索引名称test1来从mysql获取数据。因此,我们先在myslq中,创建表和数据
CREATE TABLE `tb_post` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`userid` mediumint(8) unsigned NOT NULL DEFAULT '0' COMMENT '用户id',
`title` varchar(100) NOT NULL DEFAULT '' COMMENT '标题',
`content` mediumtext NOT NULL COMMENT '内容',
`tags` varchar(100) NOT NULL DEFAULT '' COMMENT '标签',
`posttime` datetime NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT '发布时间',
PRIMARY KEY (`id`)
);
INSERT INTO `tb_post` VALUES ('1', '1', 'epoll边沿触发漏报消息包问题', '开发一个即时通讯后台,底层的网络收发使用 epoll + main loop实现网络事件', ',技术,', '2016-08-05 11:50:02');
INSERT INTO `tb_post` VALUES ('2', '1', 'epoll 边沿触发和水平触发区别实战讲解', 'epoll,看结果发现只接入了两条,还有3条没接入。说明高并发时,会出现客户端连接不上的问题。', ',技术,', '2016-08-05 22:03:23');
INSERT INTO `tb_post` VALUES ('3', '1', '快速排序算法', '快速排序算法是一个挺经典的算法,值得我们学习', ',技术,', '2016-08-05 23:08:00');
创建索引
[root@centos6 data]# indexer test1
Sphinx 2.0.8-id64-release (r3831)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/etc/sphinx/sphinx.conf'...
indexing index 'test1'...
collected 37 docs, 0.8 MB
sorted 0.1 Mhits, 100.0% done
total 37 docs, 833156 bytes
total 0.082 sec, 10061176 bytes/sec, 446.81 docs/sec
total 3 reads, 0.000 sec, 57.7 kb/call avg, 0.0 msec/call avg
total 9 writes, 0.000 sec, 40.2 kb/call avg, 0.0 msec/call avg
可以看索引了37条文档,我们可以在命令行测试下效果
[root@centos6 libertyblog]# search epoll|more
Sphinx 2.0.8-id64-release (r3831)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/etc/sphinx/sphinx.conf'...
index 'test1': query 'epoll ': returned 2 matches of 2 total in 0.000 sec
displaying matches:
1. document=59, weight=2831, userid=1, posttime=Fri Aug 5 22:03:23 2016
id=59
userid=1
title=epoll ???????????????
content=开发一个即时通讯后台,底层的网络收发使用 epoll + main loop实现网络事件
......
结果匹配到了两条数据,篇幅有限,就不全列出来了。数据 1. document=59, weight=2831 表示该索引文档编号59,权重2831。以上是命令行操作,如果我们要对外提供服务,还需要启动searchd服务进程
[root@centos6 data]# service searchd start
正在启动 searchd:Sphinx 2.0.8-id64-release (r3831)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/etc/sphinx/sphinx.conf'...
WARNING: compat_sphinxql_magics=1 is deprecated; please update your application and config
listening on 127.0.0.1:9312
listening on all interfaces, port=9306
precaching index 'test1'
precaching index 'testrt'
precached 2 indexes in 0.002 sec
[确定]
启动成功,绑定了端口9312,我们查看下状态
[root@centos6 data]# searchd --status
Sphinx 2.0.8-id64-release (r3831)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/etc/sphinx/sphinx.conf'...
searchd status
--------------
uptime: 252
connections: 1
maxed_out: 0
command_search: 0
command_excerpt: 0
command_update: 0
command_keywords: 0
command_persist: 0
command_status: 1
command_flushattrs: 0
agent_connect: 0
agent_retry: 0
queries: 0
dist_queries: 0
query_wall: 0.000
query_cpu: OFF
dist_wall: 0.000
dist_local: 0.000
dist_wait: 0.000
query_reads: OFF
query_readkb: OFF
query_readtime: OFF
avg_query_wall: 0.000
avg_query_cpu: OFF
avg_dist_wall: 0.000
avg_dist_local: 0.000
avg_dist_wait: 0.000
avg_query_reads: OFF
avg_query_readkb: OFF
avg_query_readtime: OFF
现在我们用一个第三方客户端访问该服务(golang语言开发)
package main
import (
"github.com/yunge/sphinx"
"log"
)
func main()
SphinxClient := sphinx.NewClient().SetServer("localhost", 0).SetConnectTimeout(5000)
if err := SphinxClient.Error(); err != nil
log.Fatal(err)
return
// 查询,第一个参数是我们要查询的关键字,第二个是索引名称test1,第三个是备注
res, err := SphinxClient.Query("epoll", "test1", "search article!")
if err != nil
log.Fatal(err)
return
var article_ids string
for _, match := range res.Matches
article_ids += fmt.Sprintf("%d,", match.DocId)
log.Println(article_ids)
SphinxClient.Close()
打印结果,是 1 2 ,这两个id,没有id为3的,说明索引查找是准确的,因为3里面没有epoll这个单词,而1和2里面都有epoll。至此,我们的测试完成,可以把此功能和自己网站的搜索框对接,以前都是用模糊查询的方式,在数据库中 like ‘%’ 某某,这样效率其实很低,数据多的时候要等半天,现在用第三方索引来实现,速度快好几个量级。
如果有新的数据插入,或者更新数据,是需要做 增量索引 的,很简单
[root@centos6 data]# indexer --rotate test1
Sphinx 2.0.8-id64-release (r3831)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/etc/sphinx/sphinx.conf'...
indexing index 'test1'...
collected 37 docs, 0.8 MB
sorted 0.1 Mhits, 100.0% done
total 37 docs, 833156 bytes
total 0.081 sec, 10184036 bytes/sec, 452.26 docs/sec
total 3 reads, 0.000 sec, 57.7 kb/call avg, 0.0 msec/call avg
total 9 writes, 0.000 sec, 40.2 kb/call avg, 0.1 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=12074).
最好把增量索引的操作放到crontab中,定时做增量,以保持索引最新。以下是每天2点做一次增量索引
0 2 * * * indexer --rotate test1
创建于 2015-08-10 杭州,更新于 2016-08-19 杭州
该文章在以下平台同步
- >LIBERALMAN:http://api.liberalman.cn:40000/article/69
- >CSDN:http://blog.csdn.net/socho/article/details/52251177
- >简书:http://www.jianshu.com/p/5dff17e2da7b
- [ ] 引用
以上是关于抛弃mysql模糊查询,使用sphinx做专业索引的主要内容,如果未能解决你的问题,请参考以下文章
全文索引Sphinx+binlog日志+Grant用户授权+读写分离和主从复制