python库： scrapy （深坑未填）

Posted 2020-10-11

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python库： scrapy （深坑未填）相关的知识，希望对你有一定的参考价值。

scrapy　　一个快速高级的屏幕爬取及网页采集框架

http://scrapy.org/　　官网

https://docs.scrapy.org/en/latest/　　文档

安装：　　win7 安装 Scrapy：　　2017-10-19

当前环境：win7，python3.6.0，pyCharm4.5。 python目录是：c:/python3/

Scrapy依赖的库比较多，至少需要依赖库有Twisted 14.0，lxml 3.4，pyOpenSSL 0.14。

参考文章：http://www.cnblogs.com/liuliliuli2017/p/6746440.html 　　Python3环境安装Scrapy爬虫框架过程及常见错误

我在安装 Twisted 时遇到了问题。解决步骤如下：

1、http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted（重要：这个站点有非常多的whl文件！）　　到这里下载 . whl 文件

按说我机子是win764位的，本该用 Twisted-17.9.0-cp36-cp36m-win_amd64.whl，但是提示不让安装。只好瞎猫撞死耗子似的，又下载了 Twisted-17.9.0-cp36-cp36m-win32.whl 这个文件。把它放到 C:\\Python3\\Scripts\\Twisted-17.9.0-cp36-cp36m-win32.whl

运行：python pip3.exe install Twisted-17.9.0-cp36-cp36m-win32.whl

然后再运行：python pip.exe install scrapy　　，就装上了。

学习中：

cd c:\\Python3\\zz\\　　　　　　　　　　#  C:\\Python3\\zz\\  ，是我放项目的文件夹
python C:/Python3/Scripts/scrapy.exe startproject plant　　# 建立一个叫做 plant的 爬虫项目

C:\\Python3\\zz\\plant\\

├ scrapy.cfg: 　　项目的配置文件
├ plant/: 　　该项目的 python 模块。之后您将在此加入代码。
├ plant/items.py: 　　项目中的 item 文件。
├ plant/pipelines.py: 　　项目中的 pipelines 文件。
├ plant/settings.py: 　　项目的设置文件。
└ plant/spiders/: 　　放置 spider 代码的目录。

编辑 items.py

import scrapy
class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

编写第一个爬虫(Spider)，创建文件 C:\\Python3\\zz\\plant\\plant\\spiders\\quotes_spider.py

下面这两步，是看教程： https://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project　　，但是本机报错，明天再试

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            ‘http://quotes.toscrape.com/page/1/‘,
            ‘http://quotes.toscrape.com/page/2/‘,
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = ‘quotes-%s.html‘ % page
        with open(filename, ‘wb‘) as f:
            f.write(response.body)
        self.log(‘Saved file %s‘ % filename)

进入项目文件夹，运行：

cd c:\\Python3\\zz\\plantscrapy crawl quotes

....

以上是关于python库： scrapy （深坑未填）的主要内容，如果未能解决你的问题，请参考以下文章

python Scrapy库学习

Python爬虫学习使用Scrapy库

scrapy 在pycharm 中为啥没有代码提示？

表单验证如何定位到未填的选项

python第三方库scrapy框架的安装

Python爬虫库Scrapy入门1--爬取当当网商品数据