Scrapy分布式爬虫打造搜索引擎（慕课网）--爬取知乎

Posted 2020-10-21 chimuyhs

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Scrapy分布式爬虫打造搜索引擎（慕课网）--爬取知乎相关的知识，希望对你有一定的参考价值。

通过Scrapy模拟登陆知乎

通过命令让系统自动新建zhihu.py文件

技术分享图片

再进入虚拟环境

技术分享图片

通过genspider命令新建zhihu.py

scrapy genspider zhihu www.zhihu.com

技术分享图片

新建main.py文件，使得程序可以调试

 1 #coding:utf-8
 2 
 3 from scrapy.cmdline import execute  #调用这个函数可以执行scrapy的脚本
 4 
 5 import sys
 6 import os
 7 #获取当前路径os模块的abspath
 8 os.path.abspath(__file__)#获取当前py文件即mainpy文件的路径
 9 #父目录dirname
10 sys.path.append(os.path.dirname(os.path.abspath(__file__)))
11 #调用execute函数执行scrapy命令
12 execute(["scrapy","crawl","zhihu"])

在运行main.py调试之前,需要设置setting.py的文件内容(设置不遵循ROBO协议，防止很多url被过滤)

ROBOTSTXT_OBEY = False

注：match默认只匹配一行，故添加re.DOTALL使其匹配所有参数：

match_obj = re.match(‘.*name="_xsrf" value="(.*?)"‘, response.text, re.DOTALL)

最终zhihu.py文件代码：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import re
 4 import json
 5 
 6 class ZhihuSpider(scrapy.Spider):
 7     name = ‘zhihu‘
 8     allowed_domains = [‘www.zhihu.com‘]
 9     start_urls = [‘http://www.zhihu.com/‘]
10 
11     headers = {
12         "HOST": "www.zhihu.com",
13         "Referer": "https://www.zhihu.com",
14         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
15     }
16 
17     def parse(self, response):
18         pass
19 
20     def start_requests(self):
21         return [scrapy.Request(‘https://www.zhihu.com/signup?next=%2F‘, callback=self.login, headers=self.headers)]
22 
23     def login(self, response):
24         response_text = response.text
25         match_obj = re.match(‘.*name="_xsrf" value="(.*?)"‘, response.text, re.DOTALL)
26         xsrf = ‘‘
27         if match_obj:
28             print (match_obj.group(1))
29         else:
30             return ""
31 
32         if xsrf:
33             post_url = "https://www.zhihu.com/signup?next=%2F"
34             post_data = {
35                 "_xsrf": xsrf,
36                 "phone_num": "15603367590",
37                 "password":"0019wan,.WEI3618"
38             }
39 
40             return [scrapy.FormRequest(
41                 url = post_url,
42                 formdata = post_data,
43                 headers = self.headers,
44                 callback = self.check_login  #传递的是函数名称，不加括号，加括号会被调用
45             )]
46 
47     def check_login(self, response):
48         #验证服务器返回数据判断是否成功
49         text_jason = json.loads(response.text)
50         if "msg" in text_jason and text_jason["msg"] == "登陆成功":
51             for url in self.start_urls:
52                 yield self.make_requests_from_url(url, dont_filter = True, headers = self.headers)

以上是关于Scrapy分布式爬虫打造搜索引擎（慕课网）--爬取知乎的主要内容，如果未能解决你的问题，请参考以下文章