用scrapy爬取搜狗Lofter图片

Posted luozhiyun

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了用scrapy爬取搜狗Lofter图片相关的知识,希望对你有一定的参考价值。

用scrapy爬取搜狗Lofter图片

# -*- coding: utf-8 -*-
import json

import scrapy
from scrapy.http import Request
from urllib import parse
from scrapy.loader import ItemLoader

from tutorial.items import LofterSpiderItem


class LofterSpider(scrapy.Spider):
    name = "lofter"
    allowed_domains = ["pic.sogou.com"]
    start_urls = [‘http://pic.sogou.com/‘]

    # question的第一页answer的请求url
    start_answer_url = "http://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?category=LOFTER&tag=%E5%85%A8%E9%83%A8&start={0}&len=15"

    headers = {
        "HOST": "pic.sogou.com",
        "Referer": "http://pic.sogou.com",
        ‘User-Agent‘: "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
    }

    def parse(self, response):

        yield scrapy.Request(self.start_answer_url.format(0), headers=self.headers,callback=self.parse_url)

    def parse_url(self,response):
        ans_json = json.loads(response.text)
        for ans in ans_json[‘all_items‘]:
            image_url = ans[‘ori_pic_url‘]
            item_loader = ItemLoader(item=LofterSpiderItem(), response=response)
            item_loader.add_value("lofter_image_url", image_url)
            lofter_item = item_loader.load_item()
            yield lofter_item

        yield scrapy.Request(self.start_answer_url.format(ans_json[‘startIndex‘]+15), headers=self.headers,callback=self.parse_url)

settings.py

ITEM_PIPELINES = {
   ‘tutorial.pipelines.TutorialPipeline‘: 300,
    ‘tutorial.pipelines.TutorialImagePipeline‘: 1,
}
# IMAGES_URLS_FIELD = "front_image_url"
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, ‘image‘)

items.py

class LofterSpiderItem(scrapy.Item):
    lofter_image_url = scrapy.Field(
        output_processor=MapCompose(return_value)
    )

以上是关于用scrapy爬取搜狗Lofter图片的主要内容,如果未能解决你的问题,请参考以下文章

python scrapy ,几行代码实现一个搜狗图片下载器

scrapy按顺序启动多个爬虫代码片段(python3)

scrapy主动退出爬虫的代码片段(python3)

用scrapy爬取图片

爬虫爬取百度搜狗图片持久化存储

问题:用scrapy爬取下一页链接时返回空值?