Python Scrapy 自动爬虫注意细节

Posted 2020-08-24

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python Scrapy 自动爬虫注意细节相关的知识，希望对你有一定的参考价值。

一、首次爬取模拟浏览器

在爬虫文件中，添加start_request函数。如：

def start_requests(self):
　　ua = {"User-Agent": ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.2050.400 QQBrowser/9.5.10169.400‘}
　　yield Request("http://www.baidu.com", headers=ua)

需要导入：from scrapy.http import Request

二、自动爬取模拟浏览器

打开settings.py，为USER_AGENT赋值，如：USER_AGENT = ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.2050.400 QQBrowser/9.5.10169.400‘

三、注释原起始页

如使用了start_requests方法，需要注释：start_urls = [‘http://www.baidu.com/‘]

四、目标网站的爬虫协议

ROBOTSTXT_OBEY = False

以上是关于Python Scrapy 自动爬虫注意细节的主要内容，如果未能解决你的问题，请参考以下文章