python 用于Octopress的Python Web Scraper将数据推送到ES
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 用于Octopress的Python Web Scraper将数据推送到ES相关的知识,希望对你有一定的参考价值。
# centos: libxslt-devel python-devel
# debian:
import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch
es_client = Elasticsearch(['http://10.0.1.11:9200'])
drop_index = es_client.indices.create(index='myindex-test', ignore=400)
create_index = es_client.indices.delete(index='myindex-test', ignore=[400, 404])
def urlparser(title, url):
# scrape title
p = {}
post = title
page = requests.get(post).content
soup = BeautifulSoup(page, 'lxml')
title_name = soup.title.string
# scrape tags
tag_names = []
desc = soup.findAll(attrs={"class":"category"})
for x in desc:
tag_names.append(x.text)
# payload for elasticsearch
doc = {
'date': time.strftime("%Y-%m-%d"),
'title': title_name,
'tags': tag_names,
'url': url
}
# ingest payload into elasticsearch
res = es_client.index(index="myindex-test", doc_type="docs", body=doc)
#print(doc)
time.sleep(0.5)
sitemap_feed = 'http://blog.ruanbekker.com/sitemap.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urls = [element.text for element in sitemap_index.findAll('loc')]
for x in urls:
urlparser(x, x)
以上是关于python 用于Octopress的Python Web Scraper将数据推送到ES的主要内容,如果未能解决你的问题,请参考以下文章
用Python编写博客导出工具
用Python编写博客导出工具
Octopress博客使用
ruby Octopress命令
markdown 将您的Octopress文件导出到Ghost
Octopress博客搭建