python Python脚本,用于Scrapes Sitemap并将URL,标题和标签提取到Elasticsearch
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python Python脚本,用于Scrapes Sitemap并将URL,标题和标签提取到Elasticsearch相关的知识,希望对你有一定的参考价值。
# centos: libxslt-devel python-devel
# debian:
import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch
es_client = Elasticsearch(['http://10.0.1.11:9200'])
drop_index = es_client.indices.create(index='myindex-test', ignore=400)
create_index = es_client.indices.delete(index='myindex-test', ignore=[400, 404])
def urlparser(title, url):
# scrape title
p = {}
post = title
page = requests.get(post).content
soup = BeautifulSoup(page, 'lxml')
title_name = soup.title.string
# scrape tags
tag_names = []
desc = soup.findAll(attrs={"property":"article:tag"})
for x in xrange(len(desc)):
tag_names.append(desc[x-1]['content'].encode('utf-8'))
# payload for elasticsearch
doc = {
'date': time.strftime("%Y-%m-%d"),
'title': title_name,
'tags': tag_names,
'url': url
}
# ingest payload into elasticsearch
res = es_client.index(index="myindex-test", doc_type="docs", body=doc)
time.sleep(0.5)
sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urls = [element.text for element in sitemap_index.findAll('loc')]
for x in urls:
urlparser(x, x)
以上是关于python Python脚本,用于Scrapes Sitemap并将URL,标题和标签提取到Elasticsearch的主要内容,如果未能解决你的问题,请参考以下文章
用于运行命令行的 Python 脚本,该命令行启动具有特定 Python 版本的 Python 脚本
python 用于解析Arguemtns的Python脚本
python 用于脚本和模块的头部python
python 用于导出kibana实体的Python脚本
python 用于python随机搜索的时间测试脚本
用于缩小 CSS 的 Python 脚本? [关闭]