elasticsearch
Posted 起航追梦人
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了elasticsearch相关的知识,希望对你有一定的参考价值。
1、安装elasticsearch-rtf(elasticsearch中文发行版,针对中文集成了相关插件,方便新手学习测试.)
https://github.com/ 上搜索elasticsearch-rtf下载最新版,cmd运行bin文件夹下elasticsearch.bat
2、在浏览器中输入:127.0.0.1:9200显示如下则安装成功:
------------------------------------
{ "name" : "ewadZmQ", "cluster_name" : "elasticsearch", "cluster_uuid" : "-BfaRD5ETwuGxlEEPqJNqQ", "version" : { "number" : "5.1.1", "build_hash" : "5395e21", "build_date" : "2016-12-06T12:36:15.409Z", "build_snapshot" : false, "lucene_version" : "6.3.0" }, "tagline" : "You Know, for Search" }
---------------------------------------
3、head插件安装
1)在github上搜索elasticsearch-head下载第一个,
2)安装node.js(http://nodejs.cn/download/),安装完成后输入:node -v 输出v6.10.3 这样的版本号,就安装成功了,再输入:npm - v输出3.10.10 这样的版本号npm就安装成功了(node.js集成了npm)
3)安装cnpm(http://npm.taobao.org/),cmd下运行:npm install -g cnpm --registry=https://registry.npm.taobao.org
4)cmd到elasticsearch-head目录,运行cnpm install 完成后再运行cnpm
run start
5)打开网页:http://localhost:910如下图
提示链接不到http://127.0.0.1:9200/端口,为什么?elasticsearch默认情况下不允许使用第三方服务,所以不能链接
解决:在elasticsearch-rft的config文件夹下的elasticsearch.yml文件最后加入如下配置:
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Length, X-User"
重起elasticsearch,如下图链接成功
6)下载安装Kibana 5.1.1(elasticsearch是5.1.1)(https://www.elastic.co/downloads/past-releases),cmd下在bin文件夹下运行kibana.bat文件,打开网页:http://127.0.0.1:5601/,安装成功。
7)把scrapy数据写入到elasticsearch:
a、先cmd到虚拟环境中安装重elasticsearch-dsl(scrapy操作elasticsearch的高级接口):
pip install elasticsearch-dsl
b、创建文件夹models,再创建一个es_types.py文件,定义字段类型并运行文件建立索引:
from datetime import datetime from elasticsearch_dsl import DocType, Date, Nested, Boolean, analyzer, InnerObjectWrapper, Completion, Keyword, Text, Integer from elasticsearch_dsl.connections import connections connections.create_connection(hosts=[\'localhost\']) class ArticleType(DocType): #文章类型 title = Text(analyzer="ik_max_word") create_date = Date() praise_nums = Integer() fav_nums = Integer() comment_nums = Integer() tags = Text(analyzer="ik_max_word") front_image_url = Keyword() url_object_id = Keyword() front_image_path = Keyword() url = Keyword() content = Text(analyzer="ik_max_word") class Meta: index = \'jobbole\' doc_type = \'article\' if __name__ == \'__main__\': ArticleType.init()
c、在Pipelines.py文件字义一个pipeline类:
class ElasticsearchPipeline(object): #把数据写入elasticsearch def process_item(self, item, spider): #把item转换为elasticsearch数据 article = ArticleType() article.title = item[\'title\'] article.create_date = item[\'create_date\'] article.content = remove_tags(item[\'content\']) #remove_tags()去除html标签 article.front_image_url = item[\'front_image_url\'] article.front_image_path = item[\'front_image_path\'] article.praise_nums = item[\'praise_nums\'] article.fav_nums = item[\'fav_nums\'] article.comment_nums = item[\'comment_nums\'] article.url = item[\'url\'] article.tags = item[\'tags\'] article.meta.id = item[\'url_object_id\'] article.save() #保存 return item
d、再把Pipelines.py文件中的ElasticsearchPipeline类配置到settings.py文件中:
ITEM_PIPELINES = {\'spider.pipelines.ElasticsearchPipeline\': 1}
e、运行scrapy程序,在http://127.0.0.1:9100/中的数据浏览中显示如下,则配置成功。
优化:为了不同爬虫能利用同一个Pipelines类,把Pipelines类功能放入到item.py文件中的相应item类中:
class JobboleArticleItem(scrapy.Item): title = scrapy.Field() create_date = scrapy.Field(input_processor=MapCompose(date_convert)) praise_nums = scrapy.Field(input_processor=MapCompose(number_convert)) fav_nums = scrapy.Field(input_processor=MapCompose(number_convert)) comment_nums = scrapy.Field(input_processor=MapCompose(number_convert)) tags = scrapy.Field(input_processor=MapCompose(remove_comment_tags), output_processor=Join(\',\')) front_image_url = scrapy.Field(output_processor=MapCompose(returnValue)) url_object_id = scrapy.Field(input_processor=MapCompose(get_md5)) front_image_path = scrapy.Field() url = scrapy.Field() content = scrapy.Field() def get_insert_mysql(self): #写入数据到mysql insert_sql = """ insert into jobbole(front_image_url,front_image_path,title,url,create_date,url_object_id,fav_nums,comment_nums,praise_nums,tags,content) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) ON DUPLICATE KEY UPDATE fav_nums=VALUES(fav_nums),comment_nums=VALUES(comment_nums),praise_nums=VALUES(praise_nums) """ params = (self[\'front_image_url\'][0], self[\'front_image_path\'], self[\'title\'], self[\'url\'], self[\'create_date\'], self[\'url_object_id\'], self[\'fav_nums\'], self[\'comment_nums\'], self[\'praise_nums\'], self[\'tags\'], self[\'content\']) return insert_sql, params def save_to_elasticsearch(self): #写入数据到elasticsearch article = ArticleType() article.title = self[\'title\'] article.create_date = self[\'create_date\'] article.content = remove_tags(self[\'content\']) # remove_tags()去除html标签 article.front_image_url = self[\'front_image_url\'] if \'front_image_path\' in self: article.front_image_path = self[\'front_image_path\'] article.praise_nums = self[\'praise_nums\'] article.fav_nums = self[\'fav_nums\'] article.comment_nums = self[\'comment_nums\'] article.url = self[\'url\'] article.tags = self[\'tags\'] article.meta.id = self[\'url_object_id\'] article.save() # 保存 return
然后再在Pipelines.py文件pipeline类调用save_to_elasticsearch():
class ElasticsearchPipeline(object): #把数据写入elasticsearch def process_item(self, item, spider): #把item转换为elasticsearch数据 item.save_to_elasticsearch() return item
以上是关于elasticsearch的主要内容,如果未能解决你的问题,请参考以下文章
使用标准库Ruby将数据标记到Elasticsearch批量中
Elasticsearch:如何在 Elasticsearch 中正确使用同义词功能