elasticsearch

Posted 起航追梦人

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了elasticsearch相关的知识,希望对你有一定的参考价值。

1、安装elasticsearch-rtf(elasticsearch中文发行版,针对中文集成了相关插件,方便新手学习测试.)

  https://github.com/ 上搜索elasticsearch-rtf下载最新版,cmd运行bin文件夹下elasticsearch.bat

2、在浏览器中输入:127.0.0.1:9200显示如下则安装成功:

------------------------------------

{
  "name" : "ewadZmQ",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "-BfaRD5ETwuGxlEEPqJNqQ",
  "version" : {
    "number" : "5.1.1",
    "build_hash" : "5395e21",
    "build_date" : "2016-12-06T12:36:15.409Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
  },
  "tagline" : "You Know, for Search"
}
---------------------------------------
3、head插件安装

 1)在github上搜索elasticsearch-head下载第一个,

 2)安装node.js(http://nodejs.cn/download/),安装完成后输入:node -v 输出v6.10.3 这样的版本号,就安装成功了,再输入:npm - v输出3.10.10 这样的版本号npm就安装成功了(node.js集成了npm)

 3)安装cnpm(http://npm.taobao.org/),cmd下运行:npm install -g cnpm --registry=https://registry.npm.taobao.org

 4)cmd到elasticsearch-head目录,运行cnpm install 完成后再运行cnpm run start

 5)打开网页:http://localhost:910如下图

提示链接不到http://127.0.0.1:9200/端口,为什么?elasticsearch默认情况下不允许使用第三方服务,所以不能链接

 解决:在elasticsearch-rft的config文件夹下的elasticsearch.yml文件最后加入如下配置:

http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Length, X-User"

重起elasticsearch,如下图链接成功

6)下载安装Kibana 5.1.1(elasticsearch是5.1.1)(https://www.elastic.co/downloads/past-releases),cmd下在bin文件夹下运行kibana.bat文件,打开网页:http://127.0.0.1:5601/,安装成功。

 

7)把scrapy数据写入到elasticsearch:
  a、先cmd到虚拟环境中安装重
elasticsearch-dsl(scrapy操作elasticsearch的高级接口):
pip install elasticsearch-dsl
  b、创建文件夹models,再创建一个es_types.py文件,定义字段类型并运行文件建立索引:
from datetime import datetime
from elasticsearch_dsl import DocType, Date, Nested, Boolean, analyzer, InnerObjectWrapper, Completion, Keyword, Text, Integer
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=[\'localhost\'])
class ArticleType(DocType):
    #文章类型
    title = Text(analyzer="ik_max_word")
    create_date = Date()
    praise_nums = Integer()
    fav_nums = Integer()
    comment_nums = Integer()
    tags = Text(analyzer="ik_max_word")
    front_image_url = Keyword()
    url_object_id = Keyword()
    front_image_path = Keyword()
    url = Keyword()
    content = Text(analyzer="ik_max_word")

    class Meta:
        index = \'jobbole\'
        doc_type = \'article\'
if __name__ == \'__main__\':
    ArticleType.init()

  c、在Pipelines.py文件字义一个pipeline类:
class ElasticsearchPipeline(object):
    #把数据写入elasticsearch
    def process_item(self, item, spider):
        #把item转换为elasticsearch数据
        article = ArticleType()
        article.title = item[\'title\']
        article.create_date = item[\'create_date\']
        article.content = remove_tags(item[\'content\'])  #remove_tags()去除html标签
        article.front_image_url = item[\'front_image_url\']
        article.front_image_path = item[\'front_image_path\']
        article.praise_nums = item[\'praise_nums\']
        article.fav_nums = item[\'fav_nums\']
        article.comment_nums = item[\'comment_nums\']
        article.url = item[\'url\']
        article.tags = item[\'tags\']
        article.meta.id = item[\'url_object_id\']

        article.save() #保存
        return item

  d、再把Pipelines.py文件中的ElasticsearchPipeline类配置到settings.py文件中:

ITEM_PIPELINES = {\'spider.pipelines.ElasticsearchPipeline\': 1}

  e、运行scrapy程序,在http://127.0.0.1:9100/中的数据浏览中显示如下,则配置成功。

 优化:为了不同爬虫能利用同一个Pipelines类,把Pipelines类功能放入到item.py文件中的相应item类中:

class JobboleArticleItem(scrapy.Item):
    title = scrapy.Field()
    create_date = scrapy.Field(input_processor=MapCompose(date_convert))
    praise_nums = scrapy.Field(input_processor=MapCompose(number_convert))
    fav_nums = scrapy.Field(input_processor=MapCompose(number_convert))
    comment_nums = scrapy.Field(input_processor=MapCompose(number_convert))
    tags = scrapy.Field(input_processor=MapCompose(remove_comment_tags), output_processor=Join(\',\'))
    front_image_url = scrapy.Field(output_processor=MapCompose(returnValue))
    url_object_id = scrapy.Field(input_processor=MapCompose(get_md5))
    front_image_path = scrapy.Field()
    url = scrapy.Field()
    content = scrapy.Field()

    def get_insert_mysql(self):
      #写入数据到mysql
        insert_sql = """
                    insert into jobbole(front_image_url,front_image_path,title,url,create_date,url_object_id,fav_nums,comment_nums,praise_nums,tags,content)
                    values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                    ON DUPLICATE KEY UPDATE fav_nums=VALUES(fav_nums),comment_nums=VALUES(comment_nums),praise_nums=VALUES(praise_nums)
                    """
        params = (self[\'front_image_url\'][0], self[\'front_image_path\'], self[\'title\'], self[\'url\'], self[\'create_date\'],
                  self[\'url_object_id\'], self[\'fav_nums\'], self[\'comment_nums\'], self[\'praise_nums\'], self[\'tags\'],
                  self[\'content\'])
        return insert_sql, params

    def save_to_elasticsearch(self):
        #写入数据到elasticsearch
        article = ArticleType()
        article.title = self[\'title\']
        article.create_date = self[\'create_date\']
        article.content = remove_tags(self[\'content\'])  # remove_tags()去除html标签
        article.front_image_url = self[\'front_image_url\']
        if \'front_image_path\' in self:
            article.front_image_path = self[\'front_image_path\']
        article.praise_nums = self[\'praise_nums\']
        article.fav_nums = self[\'fav_nums\']
        article.comment_nums = self[\'comment_nums\']
        article.url = self[\'url\']
        article.tags = self[\'tags\']
        article.meta.id = self[\'url_object_id\']

        article.save()  # 保存
        return        

然后再在Pipelines.py文件pipeline类调用save_to_elasticsearch():

class ElasticsearchPipeline(object):
    #把数据写入elasticsearch
    def process_item(self, item, spider):
        #把item转换为elasticsearch数据
        item.save_to_elasticsearch()
        return item

 

 

以上是关于elasticsearch的主要内容,如果未能解决你的问题,请参考以下文章

Elasticsearch笔记九之优化

使用标准库Ruby将数据标记到Elasticsearch批量中

Elasticsearch:如何在 Elasticsearch 中正确使用同义词功能

Elasticsearch:如何在 Elasticsearch 中正确使用同义词功能

Elasticsearch-PHP 索引操作

elasticsearch 特殊字段