scrapy框架项目:抓取全部知乎用户信息,并且保存至mongodb

Posted cwkcwk

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了scrapy框架项目:抓取全部知乎用户信息,并且保存至mongodb相关的知识,希望对你有一定的参考价值。

import scrapy
import json,time,re
from zhihuinfo.items import ZhihuinfoItem


class ZhihuSpider(scrapy.Spider):
name = ‘zhihu‘
allowed_domains = [‘www.zhihu.com‘]
start_urls = [‘https://www.zhihu.com/api/v4/members/eve-lee-55/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20‘,]

def parse(self, response):
temp_data = json.loads(response.body.decode("utf-8"))["data"]
count = len(temp_data)
#如果用户信息数字低于18 说明已经到达最后一页
if count <= 18:
pass

#如果没有达到最后一页,则改变offset促使爬虫翻页
else:
offset = re.findall(re.compile(r‘&offset=(.*?)&‘),response.url)[0]
new_offset = int(offset) + 20
print(new_offset)
time.sleep(1)
yield scrapy.Request("https://www.zhihu.com/api/v4/members/eve-lee-55/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset="+str(new_offset)+"&limit=20",callback=self.parse,dont_filter=True)

for i in temp_data:
#print(i)
#print("***************"*10)
#print(response.url)
#print("***************" * 10)

item = ZhihuinfoItem()
item["name"] = i["name"]
item["url_token"] = i["url_token"]
item["headline"] = i["headline"]
item["follower_count"] = i["follower_count"]
item["answer_count"] = i["answer_count"]
item["articles_count"] = i["articles_count"]
item["id"] = i["id"]
item["type"] = i["type"]

with open("userinfo.txt") as f:
user_list = f.read()

#建立一个文档,把爬取过的用户信息其中的url_token写入,防止重复爬取用户
if i["url_token"] not in user_list:
with open("userinfo.txt","a") as f: #"a" 是 追加 的意思
f.write(i["url_token"]+"-----")
yield item
#print(i["url_token"])

#切换到新的用户的关注列表内
#这样爬虫就不断蔓延,理论上就可以无限爬取完所有互动性强的活跃用户。
new_url = "https://www.zhihu.com/api/v4/members/" + i["url_token"] + "/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=20&limit=20"
time.sleep(1)
yield scrapy.Request(url=new_url,callback=self.parse)




pipelines

import pymongo
from scrapy.conf import settings

class ZhihuinfoPipeline(object):
def __init__(self):
host = settings["MONGODB_HOST"]
port = settings["MONGODB_PORT"]
dbname = settings["MONGODB_DBNAME"]
client = pymongo.MongoClient(host=host,port=port)
tdb = client[dbname]
self.post = tdb[settings["MONGODB_DOCNAME"]]

def process_item(self, item, spider):
zhihuzhihu = dict(item)
self.post.insert(zhihuzhihu)
return item


items
import scrapy


class ZhihuinfoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
url_token = scrapy.Field()
headline = scrapy.Field()
follower_count = scrapy.Field()
answer_count = scrapy.Field()
articles_count= scrapy.Field()
id = scrapy.Field()
type = scrapy.Field()



































































































以上是关于scrapy框架项目:抓取全部知乎用户信息,并且保存至mongodb的主要内容,如果未能解决你的问题,请参考以下文章

运维学python之爬虫高级篇scrapy爬取知乎关注用户存入mongodb

利用Scrapy爬取所有知乎用户详细信息并存至MongoDB

scrapy框架项目:抓取链家 全武汉的二手房信息

scrapy 知乎用户信息爬虫

scrapy爬虫框架实例之一

Scrapy框架第一个爬虫项目--汽车之家二手车列表信息抓取