爬虫入门-5-2.scrapy框架下载图片
Posted min-r
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫入门-5-2.scrapy框架下载图片相关的知识,希望对你有一定的参考价值。
scrapy startproject bmw
cd bmw
scrapy genspider bmw5 ‘autohome.com.cn‘
第一种方式:不使用ImagePipeline
bww5.py:
1 import scrapy 2 from bmw.items import BmwItem 3 4 5 class Bmw5Spider(scrapy.Spider): 6 name = ‘bmw5‘ 7 allowed_domains = [‘autohome.com.cn‘] 8 start_urls = [‘https://car.autohome.com.cn/pic/series/65.html‘] 9 10 def parse(self, response): 11 uiboxs = response.xpath(‘//div[@class = "uibox"]‘)[1:] 12 for uibox in uiboxs: 13 category = uibox.xpath(‘.//div[@class = "uibox-title"]/a/text()‘).get() 14 urls = uibox.xpath(‘.//ul/li/a/img/@src‘).getall() 15 urls = list(map(lambda url: response.urljoin(url), urls)) 16 item = BmwItem(category=category, urls=urls) 17 yield item
items.py:
1 import scrapy 2 3 4 class BmwItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 category=scrapy.Field() 8 urls=scrapy.Field()
settings.py部分设置:
1 ITEM_PIPELINES = { 2 ‘bmw.pipelines.BmwPipeline‘: 300, 3 }
pipelines.py:
1 import os 2 from urllib import request 3 4 class BmwPipeline(object): 5 def __init__(self): 6 self.path = os.path.join(os.path.dirname(__file__), ‘images‘) 7 if not os.path.exists(self.path): 8 os.mkdir(self.path) 9 10 def process_item(self, item, spider): 11 category = item[‘category‘] 12 urls = item[‘urls‘] 13 category_path = os.path.join(self.path, category) 14 if not os.path.exists(category_path): 15 os.mkdir(category_path) 16 for url in urls: 17 image_name = url.split(‘_‘)[-1] 18 request.urlretrieve(url, os.path.join(category_path, image_name)) 19 return item
第二种:通过ImagesPipeline来保存图片
步骤:
1.定义好一个Item,然后在这个item中定义两个属性,分别为:image_urls和images
images_urls是用来存储需要下载的图片的url链接,需要给一个列表
2.当文件下载完成后,会把文件下载相关信息存储到item的images属性中,比如下载路径,下载的url和图片的校验码等
3.在配置文件settings.py中配置IMAGES_STORE,这个配置是用来设置图片下载下来的路径
在配置文件settings.py中配置IMAGES_URLS_FIELD,这个配置是设置图片路径的item字段名
(注:特别重要,不然图片文件夹为空)
4.启动pipeline:在ITEM_PIPELINES中设置scrapy.pipelines.images.ImagesPipeline:1
改写pipelines.py:
1 import os 2 from scrapy.pipelines.images import ImagesPipeline 3 from bmw import settings 4 5 class BMWImagesPipeline(ImagesPipeline): # 继承ImagesPipeline 6 # 该方法在发送下载请求前调用,本身就是发送下载请求的 7 def get_media_requests(self, item, info): 8 request_objects = super(BMWImagesPipeline, self).get_media_requests(item, info) # super()直接调用父类对象 9 for request_object in request_objects: 10 request_object.item = item 11 return request_objects 12 13 def file_path(self, request, response=None, info=None): 14 path = super(BMWImagesPipeline, self).file_path(request, response, info) 15 # 该方法是在图片将要被存储时调用,用于获取图片存储的路径 16 category = request.item.get(‘category‘) 17 images_stores = settings.IMAGES_STORE # 拿到IMAGES_STORE 18 category_path = os.path.join(images_stores, category) 19 if not os.path.exists(category_path): # 判断文件名是否存在,如果不存在创建文件 20 os.mkdir(category_path) 21 image_name = path.replace(‘full/‘, ‘‘) 22 image_path = os.path.join(category_path, image_name) 23 return image_path
改写settings.py:
1 import os 2 IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), ‘imgs‘) 3 IMAGES_URLS_FIELD=‘urls‘
4 ITEM_PIPELINES = {
5 ‘bmw.pipelines.BMWImagesPipeline‘: 1,
}
pycharm运行scrapy需要在项目文件夹下新建一个start.py:
1 from scrapy import cmdline 2 3 cmdline.execute([‘scrapy‘, ‘crawl‘, ‘bmw5‘])
以上是关于爬虫入门-5-2.scrapy框架下载图片的主要内容,如果未能解决你的问题,请参考以下文章
python3 TensorFlow训练数据集准备 下载一些百度图片 入门级爬虫示例
python爬虫之Scrapy框架,基本介绍使用以及用框架下载图片案例