爬虫Scrapy Item Loaders使用方法
Posted fqh202
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫Scrapy Item Loaders使用方法相关的知识,希望对你有一定的参考价值。
介绍
Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.
Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.
Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules
Input and Output processors
An Item Loader contains one input processor and one output processor for each (item) field. The input processor processes the extracted data as soon as it’s received (through the add_xpath(), add_css() or add_value() methods) and the result of the input processor is collected and kept inside the ItemLoader. After collecting all data, the ItemLoader.load_item() method is called to populate and get the populated Item object. That’s when the output processor is called with the data previously collected (and processed using the input processor). The result of the output processor is the final value that gets assigned to the item.
使用方法实例
以7中爬取伯乐文章的处理为例,使用itemloader后:
items.py
from scrapy.loader.processors import MapCompose,TakeFirst
import datetime
from scrapy.loader import ItemLoader
class ArticleItemLoader(ItemLoader):
"""自定制ItemLoader,取值都会调用TakeFirst函数"""
default_output_processor = TakeFirst()
def transform_date(publish_date):
"""在item赋值前处理之前xpath定位的的publish_date字段并返回"""
try:
publish_date = publish_date.strip().split(' ')[0]
publish_date = datetime.datetime.strptime(publish_date, "%Y/%m/%d")
except:
publish_date = datetime.datetime.now()
return publish_date
def get_collect_num(value):
"""在item赋值前处理之前xpath定位的的 collect_num 字段并返回"""
try:
collect_num = int(value.strip().split(' ')[0])
except:
collect_num=0
return collect_num
def return_value(value):
"""覆盖output_processor,保持默认的状态"""
return value
class JobboleArticleItem(scrapy.Item):
title = scrapy.Field()
publish_date = scrapy.Field(
# 对传入到item的值调用指定的函数进行预处理,且自动传入传入当前字段值
input_processor= MapCompose(transform_date),
)
cate = scrapy.Field()
favor_num = scrapy.Field(
input_processor=MapCompose(lambda x:int(x))
)
collect_num = scrapy.Field(
input_processor=MapCompose(get_collect_num)
) # 收藏
img_url=scrapy.Field(
output_processor=MapCompose(return_value)
) # 封面图片
img_save_path=scrapy.Field(
) # 封面图片
url = scrapy.Field() # 当前图片路径
jobbole_spider.py
from scrapy.loader import ItemLoader
from ..items import JobboleArticleItem,ArticleItemLoader
def parse_detail(self,response):
"""解析文章的具体字段"""
img_url = response.meta.get('img_url','')
img_url = urljoin(response.url,img_url)
# 1、实例化ArticleItemLoader对象
l = ArticleItemLoader(item=JobboleArticleItem(), response=response)
# 2、搜集数据,指定保存的字段名和xpath路径,
# l.add_css('title','...')
l.add_xpath('title', '//*[@class="entry-header"]/h1/text()')
l.add_xpath('publish_date', '//*[@class="entry-meta-hide-on-mobile"]/text()[1]')
l.add_xpath('cate', '//p[@class="entry-meta-hide-on-mobile"]/a[1]/text()')
l.add_xpath('favor_num', '//*[@class="post-adds"]/span[1]/h10/text()')
l.add_xpath('collect_num','//*[@class="post-adds"]/span[2]/text()')
# 2、添加已经确定的值到loader中
l.add_value('img_url', [img_url]) # 特殊字段,必须输入列表或元祖对象,供默认的imagepipeline使用
l.add_value('url', response.url)
l.add_value('img_save_path', '')
# 3、调用load_item方法取出最终的item对象并返回
loaded_item = l.load_item()
return loaded_item
以上是关于爬虫Scrapy Item Loaders使用方法的主要内容,如果未能解决你的问题,请参考以下文章