爬虫入门-5-2.scrapy框架下载图片

Posted min-r

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫入门-5-2.scrapy框架下载图片相关的知识,希望对你有一定的参考价值。

scrapy startproject bmw

cd bmw

scrapy genspider bmw5 ‘autohome.com.cn‘

第一种方式:不使用ImagePipeline

bww5.py:

 1 import scrapy
 2 from bmw.items import BmwItem
 3 
 4 
 5 class Bmw5Spider(scrapy.Spider):
 6     name = bmw5
 7     allowed_domains = [autohome.com.cn]
 8     start_urls = [https://car.autohome.com.cn/pic/series/65.html]
 9 
10     def parse(self, response):
11         uiboxs = response.xpath(//div[@class = "uibox"])[1:]
12         for uibox in uiboxs:
13             category = uibox.xpath(.//div[@class = "uibox-title"]/a/text()).get()
14             urls = uibox.xpath(.//ul/li/a/img/@src).getall()
15             urls = list(map(lambda url: response.urljoin(url), urls))
16             item = BmwItem(category=category, urls=urls)
17             yield item

items.py:

1 import scrapy
2 
3 
4 class BmwItem(scrapy.Item):
5     # define the fields for your item here like:
6     # name = scrapy.Field()
7     category=scrapy.Field()
8     urls=scrapy.Field()

settings.py部分设置:

1 ITEM_PIPELINES = {    
2      bmw.pipelines.BmwPipeline: 300,
3 }

pipelines.py:

 1 import os
 2 from urllib import request
 3 
 4 class BmwPipeline(object):
 5     def __init__(self):
 6         self.path = os.path.join(os.path.dirname(__file__), images)
 7         if not os.path.exists(self.path):
 8             os.mkdir(self.path)
 9 
10     def process_item(self, item, spider):
11         category = item[category]
12         urls = item[urls]
13         category_path = os.path.join(self.path, category)
14         if not os.path.exists(category_path):
15             os.mkdir(category_path)
16         for url in urls:
17             image_name = url.split(_)[-1]
18             request.urlretrieve(url, os.path.join(category_path, image_name))
19         return item

第二种:通过ImagesPipeline来保存图片

步骤:

1.定义好一个Item,然后在这个item中定义两个属性,分别为:image_urlsimages
images_urls是用来存储需要下载的图片的url链接,需要给一个列表

2.当文件下载完成后,会把文件下载相关信息存储到itemimages属性中,比如下载路径,下载的url和图片的校验码等
3.在配置文件settings.py中配置IMAGES_STORE,这个配置是用来设置图片下载下来的路径
在配置文件settings.py中配置IMAGES_URLS_FIELD,这个配置是设置图片路径的item字段名
(注:特别重要,不然图片文件夹为空)
4.启动pipeline:ITEM_PIPELINES中设置scrapy.pipelines.images.ImagesPipeline:1

改写pipelines.py:

 1 import os
 2 from scrapy.pipelines.images import ImagesPipeline
 3 from bmw import settings
 4 
 5 class BMWImagesPipeline(ImagesPipeline):  # 继承ImagesPipeline
 6     # 该方法在发送下载请求前调用,本身就是发送下载请求的
 7     def get_media_requests(self, item, info):
 8         request_objects = super(BMWImagesPipeline, self).get_media_requests(item, info)  # super()直接调用父类对象
 9         for request_object in request_objects:
10             request_object.item = item
11         return request_objects
12 
13     def file_path(self, request, response=None, info=None):
14         path = super(BMWImagesPipeline, self).file_path(request, response, info)
15         # 该方法是在图片将要被存储时调用,用于获取图片存储的路径
16         category = request.item.get(category)
17         images_stores = settings.IMAGES_STORE  # 拿到IMAGES_STORE
18         category_path = os.path.join(images_stores, category)
19         if not os.path.exists(category_path):  # 判断文件名是否存在,如果不存在创建文件
20             os.mkdir(category_path)
21         image_name = path.replace(full/, ‘‘)
22         image_path = os.path.join(category_path, image_name)
23         return image_path

改写settings.py:

1 import os
2 IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), imgs)
3 IMAGES_URLS_FIELD=urls

4
ITEM_PIPELINES = {
5 ‘bmw.pipelines.BMWImagesPipeline‘: 1,
}

pycharm运行scrapy需要在项目文件夹下新建一个start.py:

1 from scrapy import cmdline
2 
3 cmdline.execute([scrapy, crawl, bmw5])

 

 

以上是关于爬虫入门-5-2.scrapy框架下载图片的主要内容,如果未能解决你的问题,请参考以下文章

爬虫 - Scrapy 框架简介与入门

爬虫2.4-scrapy框架-图片分类下载

python3 TensorFlow训练数据集准备 下载一些百度图片 入门级爬虫示例

python爬虫之Scrapy框架,基本介绍使用以及用框架下载图片案例

python爬虫之Scrapy框架,基本介绍使用以及用框架下载图片案例

Python爬虫入门:27270图片爬取