scrapy学习第二课

Posted 2022-12-10 helenandyoyo

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了scrapy学习第二课相关的知识，希望对你有一定的参考价值。

python爬虫框架scrapy学习第二课

练习任务：爬取湖北成套招标公司的招标信息

练习任务：爬取湖北成套招标公司的招标信息

第一步：新建一个爬虫项目

scrapy startproject bids

在bids路径下，创建一个基础爬虫类

scrapy genspider publicBids “hubeibidding.com”

确定要爬取的基础数据，编写item.py文件

class BidsItem(scrapy.Item):

    #招标类型
    bidsType = scrapy.Field()

    #招标项目名称
    bidsName = scrapy.Field()

    #招标项目链接
    bidsLink = scrapy.Field()

    #招标发布时间
    bidsTime = scrapy.Field()

编写publicBids类，进行数据爬取

# -*- coding: utf-8 -*-
import scrapy
from bids.items import BidsItem

class PublicbidsSpider(scrapy.Spider):
    """
    功能：练习爬取湖北省成套招标信息
    """
    name = 'publicBids'
    allowed_domains = ['hubeibidding.com']

    url = "http://www.hubeibidding.com/plus/list.php?tid=4&TotalResult=10330&PageNo="
    offset = 1

    #起始url
    start_urls = [url + str(offset)]

    def parse(self, response):

        for each in response.xpath("//ul[@class='e2']/li"):
            item = BidsItem()
            item['bidsType'] = each.xpath("./div/b/a/text()").extract()[0]
            item['bidsName'] = each.xpath("./div/a/text()").extract()[0]
            item['bidsLink'] = "http://www.hubeibidding.com/" + each.xpath("./div/a/@href").extract()[0]
            item['bidsTime'] = each.xpath("./span/text()").extract()[1]

            yield item

        # if self.offset < 10:
        #     self.offset += 1

        #yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

上述注释的代码表示爬取多网页的数据。这里作为例子，只爬取第一页的数据。

将文件保存为json格式，修改pipelines.py文件

import json

class BidsPipeline(object):
    def __init__(self):
        self.filename = open("bids.json", "wb")

    def process_item(self, item, spider):
        print("----------item------------\\n")
        print(item)
        text = json.dumps(dict(item), ensure_ascii = False) + ",\\n"
        self.filename.write(text.encode("utf-8"))
        return item

    def close_spider(self, spider):
        self.filename.close()

7.为了能执行上述pipelines.py文件中定义的函数，需要修改settings.py文件,否则函数没有执行，无法生成json文件。在文件末尾，增加

ITEM_PIPELINES = 
    'bids.pipelines.BidsPipeline': 300,

在bids文件路径下，执行爬取操作

scrapy crawl publicBids

在bids路径下，生成bids.json文件

以上是关于scrapy学习第二课的主要内容，如果未能解决你的问题，请参考以下文章