python 用于Octopress的Python Web Scraper将数据推送到ES

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 用于Octopress的Python Web Scraper将数据推送到ES相关的知识,希望对你有一定的参考价值。

# centos: libxslt-devel python-devel
# debian: 
import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch

es_client = Elasticsearch(['http://10.0.1.11:9200'])

drop_index = es_client.indices.create(index='myindex-test', ignore=400)
create_index = es_client.indices.delete(index='myindex-test', ignore=[400, 404])

def urlparser(title, url):
    # scrape title
    p = {}
    post = title
    page = requests.get(post).content
    soup = BeautifulSoup(page, 'lxml')
    title_name = soup.title.string

    # scrape tags
    tag_names = []
    desc = soup.findAll(attrs={"class":"category"})
    for x in desc:
        tag_names.append(x.text)

    # payload for elasticsearch
    doc = {
        'date': time.strftime("%Y-%m-%d"),
        'title': title_name,
        'tags': tag_names,
        'url': url
    }

    # ingest payload into elasticsearch
    res = es_client.index(index="myindex-test", doc_type="docs", body=doc)
    #print(doc)
    time.sleep(0.5)

sitemap_feed = 'http://blog.ruanbekker.com/sitemap.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urls = [element.text for element in sitemap_index.findAll('loc')]

for x in urls:
    urlparser(x, x)

以上是关于python 用于Octopress的Python Web Scraper将数据推送到ES的主要内容,如果未能解决你的问题,请参考以下文章

用Python编写博客导出工具

用Python编写博客导出工具

Octopress博客使用

ruby Octopress命令

markdown 将您的Octopress文件导出到Ghost

Octopress博客搭建