bokeyuan_python文章爬去入mongodb读取--LOWBIPROGRAMMER

Posted Xcsg

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了bokeyuan_python文章爬去入mongodb读取--LOWBIPROGRAMMER相关的知识,希望对你有一定的参考价值。

# -*- coding: utf-8 -*-
import requests,os
from lxml import etree
from pymongo import *

class Boke(object):
def __init__(self):
self.url ="https://www.cnblogs.com/cate/python/"
self.headers={\'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.90 Safari/537.36 2345Explorer/9.3.2.17331\'}
def get_data(self,url):
response = requests.get(url,headers=self.headers)
return response.content
def xml_data(self,data):
html = etree.HTML(data)
mes = html.xpath("//div[@class=\'post_item\']")
for i in mes:
dict={}
info_url = i.xpath("./div[@class=\'post_item_body\']/h3/a/@href")[0]
self.info_data(info_url)
dict[\'url\'] = info_url
self.write_dbs(dict)


def info_data(self,data):
path = "f:/woc/"
if not os.path.exists(path):
os.makedirs(path)
mes = self.get_data(data)
html = etree.HTML(mes)
list = html.xpath("//div[@id=\'topics\']/div[@class=\'post\']")
# print(list)
for x in list:
dictlist = {}
title = x.xpath("./h1[@class=\'postTitle\']/a/text()")[0]
info = x.xpath("./div[@class=\'postBody\']//text()")
dictlist[\'title\'] = title
dictlist[\'info\'] = info
self.write1_dbs(dictlist)

def dbs(self):
connect = MongoClient(\'127.0.0.1\',27017)
conn = connect[\'boke\']
conn1 =conn[\'zhu\']
conn2 =conn[\'info\']
return conn1,conn2
def write_dbs(self,data):
conn1,conn2 = self.dbs()
conn1.insert_one(data)
result=conn1.find()
for i in result:
print(i)
def write1_dbs(self,data):
conn1, conn2 = self.dbs()
conn2.insert_one(data)
result = conn2.find()
for i in result:
print(i)


def run(self):
url = self.url
data = self.get_data(url)
self.xml_data(data)
if __name__ == \'__main__\':
boke = Boke()
boke.run()

以上是关于bokeyuan_python文章爬去入mongodb读取--LOWBIPROGRAMMER的主要内容,如果未能解决你的问题,请参考以下文章

字节跳动的师兄跟我说:要带着离职的心态去入职?

字节跳动的师兄跟我说:要带着离职的心态去入职?

运维篇Docker知识点万字吐血大总结,学完阿里叫我明天去入职

爬虫到百度贴吧,爬去自己的小说

利用Python实现爬去彩票网站数据——小样

scrapy 爬去网页