给我爬！三天掌握Scrapy

Posted 2021-08-30 二哥不像程序员

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了给我爬！三天掌握Scrapy相关的知识，希望对你有一定的参考价值。

在三天掌握Scrapy（一）中我们简单的介绍了Scrapy的实现原理，并且搭建了一个使用Scrapy的小小爬虫，本文让我们来深入了解一下Scrapy。

一、翻页实现

很多时候我们在网页中爬取的内容都需要进行翻页的操作，下面我们来了解一下如何用Scrapy来实现翻页操作。

想要实现翻页操作，我们就要找到下一页对应的url地址，找到了地址之后构造出url对应的请求对象，然后传递给引擎就可以了。

找到页码的url
构造url对应的请求对象scrapy.Request(url,callback)
- callback：解析函数（使用哪个函数进行解析）
传递给引擎yield scrapy.Request(url,callback)
参数全貌：

 scrapy.Request(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None,
                         encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None)

代码示例如下：

import scrapy

class CarsDataSpider(scrapy.Spider):
    name = 'Cars_data'
    allowed_domains = ['12365auto.com']
    start_urls = ['http://www.12365auto.com/zlts/0-0-0-0-0-0_0-0-0-0-0-0-0-1.shtml']

    # 解析数据
    def parse(self, response):
        name_car = response.xpath(".//tbody//td[2]")
        item = {}
        item['name'] = name_car.extract()

        next_url = response.xpath(".//div[@class='p_page']/a[1]/@href").extract_first()
        next_url = 'http://www.12365auto.com/zlts/' + next_url
        if next_url:
            yield scrapy.Request(next_url, callback=self.parse)

Request中的其他参数介绍

callback：表示当前的url的响应交给哪个函数去处理
meta：实现数据在不同的解析函数中传递，meta默认带有部分数据，比如下载延迟，请求深度等
dont_filter:默认为False，会过滤请求的url地址，即请求过的url地址不会继续被请求，对需要重复请求的url地址可以把它设置为Ture，比如贴吧的翻页请求，页面的数据总是在变化;start_urls中的地址会被反复请求，否则程序不会启动
method：指定POST或GET请求
headers：接收一个字典，其中不包括cookies
cookies：接收一个字典，专门放置cookies
body：接收json字符串，为POST的数据，发送payload_post请求时使用（在下一章节中会介绍post请求）

二、模拟登陆

很多网页在爬取的时候都需要进行登陆才能看到我们想要的数据，这也就需要我们的爬虫能够进行模拟登陆，而很多第三方库都具有模拟登陆的功能，比如requests可以通过cookie进行模拟登陆，selenium可以通过input标签输入文本进行点击登陆，而本文我们要使用的Scrapy的登陆方式和requests相同也需要通过cookie进行登陆。

如果你不知道如何获取cookie，可以看一下二哥之前写过的这篇文章：
Python爬虫｜反爬初体验

直接登录

直接使用cookie进行模拟登陆的代码如下：

import scrapy


class CarsDataSpider(scrapy.Spider):
  name = 'Cars_data'
  allowed_domains = ['12365auto.com']
  start_urls = ['http://12365auto.com/']

  def start_requests(self):
      # 下面填入自己的Cookie
      cookies = '...'
      # 将cookies转换为cookies_dict
      cookies_dict = {i.split('=')[0]: i.split('=')[1] for i in cookies.split('; ')}
      yield scrapy.Request(
          self.start_urls[0],
          callback=self.parse,
          cookies=cookies_dict
      )

  # 解析数据
  def parse(self, response):
      pass

发送Post请求模拟登陆

除了通过直接使用Cookie来进行模拟登陆之外，Scrapy中也提供了两种方法来发送Post请求来获取Cookie的登陆方式。

Scrapy.FromRequest()

在使用Scrapy.FormRequest()发送Post请求实现模拟登陆，需要人为找出登录请求的地址以及构造出登录时所需的请求数据。
这里我们用github来举例子，具体的实现思路如下：

找到登陆请求的地址：点击登录按钮进行抓包，然后定位url地址为https://github.com/session
找到请求体的规律：分析post请求的请求体，其中包含的参数均在前一次的响应中
否登录成功：通过请求个人主页，观察是否包含用户名

代码实现如下：

import scrapy
import re

class GitSpider(scrapy.Spider):
   name = 'Git'
   allowed_domains = ['github.com']
   start_urls = ['https://github.com/login']

   def parse(self, response):
       authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
       utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
       commit = response.xpath("//input[@name='commit']/@value").extract_first()
        #构造POST请求，传递给引擎
        
       yield scrapy.FormRequest(
           "https://github.com/session",
           formdata={
               "login":"ergebuxiang",
               "password":"123456",
               "authenticity_token":authenticity_token,
               "utf8":utf8,
               "commit":commit
           },
           callback=self.parse_login
       )

   def parse_login(self,response):
       res = re.findall(r"Learn Git and GitHub without any code!", response.body.decode())
       print(res)

Scrapy.FormRequest.from_response()

在使用上面的Scrapy.FormRequest()进行模拟登陆的时候，我们需要找到请求体来进行登陆，而在Scrapy中还提供了一种方法，可以直接使用账号密码进行登录，代码如下：

import scrapy
import re

class Git2Spider(scrapy.Spider):
   name = 'Git2'
   allowed_domains = ['github.com']
   start_urls = ['https://github.com/login']

   def parse(self, response):

       yield scrapy.FormRequest(
           "https://github.com/session",
           formdata={
               "login":"ergebuxiang",
               "password":"123456"
           },
           callback=self.parse_login
       )

   def parse_login(self,response):
       res = re.findall(r"Learn Git and GitHub without any code!", response.body.decode())
       print(res)

以上是关于给我爬！三天掌握Scrapy的主要内容，如果未能解决你的问题，请参考以下文章