scrapy

Posted 2022-11-12 hzls

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了scrapy相关的知识，希望对你有一定的参考价值。

参考文章： https://www.cnblogs.com/liuqingzheng/articles/10261760.html

1，安装：

#Windows平台
    1、pip3 install wheel #安装后，便支持通过wheel文件安装软件，wheel文件官网：https://www.lfd.uci.edu/~gohlke/pythonlibs
    3、pip3 install lxml
    4、pip3 install pyopenssl
    5、下载并安装pywin32：https://sourceforge.net/projects/pywin32/files/pywin32/
    6、下载twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/ #搜索twisted，版本与python版本对应
    7、执行pip3 install 下载目录\\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
    8、pip3 install scrapy
  
#Linux平台
    1、pip3 install scrapy

2，命令行

#1 查看帮助
    scrapy -h
    scrapy <command> -h

#2 有两种命令：其中Project-only必须切到项目文件夹下才能执行，而Global的命令则不需要
    Global commands:
        startproject * #创建项目
        genspider *   #创建爬虫程序 1，cd myscrapy | scrapy genspider tmall www.tmall.com
        settings     #如果是在项目目录下，则得到的是该项目的配置
        runspider    #运行一个独立的python文件，不必创建项目
        shell        #scrapy shell url地址  在交互式调试，如选择器规则正确与否
        fetch        #独立于程单纯地爬取一个页面，可以拿到请求头
        view         #下载完毕后直接弹出浏览器，以此可以分辨出哪些数据是ajax请求
        version *     #scrapy version 查看scrapy的版本，scrapy version -v查看scrapy依赖库的版本
    Project-only commands:
        crawl *        # scrapy crawl tamll 运行爬虫，必须创建项目才行，确保配置文件中ROBOTSTXT_OBEY = False
        check        #检测项目中有无语法错误
        list         #列出项目中所包含的爬虫名
        edit         #编辑器，一般不用
        parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
        bench        #scrapy bentch压力测试

#3 官网链接
    https://docs.scrapy.org/en/latest/topics/commands.html

3, 项目结构以及爬虫应用简介

project_name/
   scrapy.cfg
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
           爬虫1.py
           爬虫2.py
           爬虫3.py

文件说明：

scrapy.cfg 项目的主配置信息，用来部署scrapy时使用，爬虫相关的配置信息在settings.py文件中。
items.py 设置数据存储模板，用于结构化数据，如：Django的Model
pipelines 数据处理行为，如：一般结构化的数据持久化
settings.py 配置文件，如：递归的层数、并发数，延迟下载等。强调:配置文件的选项必须大写否则视为无效，正确写法USER_AGENT=‘xxxx‘
spiders 爬虫目录，如：创建文件，编写爬虫规则

注意：一般创建爬虫文件时，以网站域名命名

#tamll.py

# -*- coding: utf-8 -*-
import scrapy
from urllib.parse import urlencode
from ..items import MyprojectItem

class TmallSpider(scrapy.Spider):
    name = ‘tmall‘
    allowed_domains = [‘www.tmall.com‘,‘httpbin.org‘]
    #爬取链接
    # 本质上走的是父类的start_requests，因为自己重写了start_requests，就不需要在这里定义url了，
    # start_urls = [‘http://httpbin.org/get‘]

    #自定义请求头
    custom_settings = 
        ‘DEFAULT_REQUEST_HEADERS‘: 
            ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
            ‘Accept-Language‘: ‘en‘,
            ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36‘
        
    

    def __init__(self,pro=None,*args,**kwargs):
        super(TmallSpider,self).__init__(*args,**kwargs)
        self.params = 
            ‘q‘:pro,
            ‘totalPage‘:1,
            ‘jumpto‘:1,
        
        self.start_urls = ‘http://list.tmall.com/search_product.htm?‘ + urlencode(self.params)

    #重写父类start_requests
    def start_requests(self):
        # for url in self.start_urls:
        yield scrapy.Request(url=self.start_urls,callback=self.get_totallpage,dont_filter=True) #dont_filter=True 不去重

    #解析函数（初次请求都会走parse）
    def get_totallpage(self, response):
        # print(‘我被解析了‘)
        # print(response.text)
        res = response.css(‘[name="totalPage"]::attr(value)‘).extract_first()
        self.params[‘totalPage‘] = int(res)
        for i in range(1,self.params[‘totalPage‘]+1):
        # for i in range(1,2):
            self.params[‘jumpto‘] = i
            self.url = ‘http://list.tmall.com/search_product.htm?‘ + urlencode(self.params)
            yield scrapy.Request(url=self.url,callback=self.get_info,dont_filter=True)

    def get_info(self,response):
        elements = response.css(‘[class="product  "]‘)
        for element in elements:
            title = element.css(‘[class="productTitle"] a::attr(title)‘).extract_first()
            price = element.css(‘[class="productPrice"] em::attr(title)‘).extract_first()
            print(title,price)
            item = MyprojectItem()
            item.title = title
            item.price = price
            yield item

#items.py
class MyprojectItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()

配置地理IP：

#middleware.py

class MyprojectDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        # return None

        #配置代理IP
        proxy =  requests.get(url=‘http://127.0.0.1:5010/get‘).text
        request.meta[‘proxy‘] = ‘http://‘ + proxy
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain

        #如果ip被封了，则删除被封IP，换新IP重新请求
        old_ip = request.meta[‘proxy‘].split(‘//‘)[1]
        requests.get(‘http://127.0.0.1:5010/delete/?proxy=‘.format(old_ip))

        proxy = request.get(url=‘http://127.0.0.1:5010/get‘).text
        request.meta[‘proxy‘] = ‘http://‘ + proxy
        return request


    def spider_opened(self, spider):
        spider.logger.info(‘Spider opened: %s‘ % spider.name)

执行文件：

#在项目目录下新建：run.py

from scrapy.cmdline import execute
execute([‘scrapy‘, ‘crawl‘, ‘tmall‘,‘-a‘,‘pro=男装‘,‘--nolog‘]) #没有日志
# execute([‘scrapy‘, ‘crawl‘, ‘tmall‘,‘-a‘,‘pro=男装‘])
# execute([‘scrapy‘, ‘crawl‘, ‘tmall‘])

以上是关于scrapy的主要内容，如果未能解决你的问题，请参考以下文章