中国农产品信息网站scrapy-redis分布式爬取数据

Posted bkwxx

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了中国农产品信息网站scrapy-redis分布式爬取数据相关的知识,希望对你有一定的参考价值。

---恢复内容开始---

 

基于scrapy_redis和mongodb的分布式爬虫

项目需求:

1:自动抓取每一个农产品的详细数据

2:对抓取的数据进行存储

 

第一步:

创建scrapy项目

技术图片

技术图片

创建爬虫文件

技术图片

在items.py里面定义我们要爬取的数据

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class NongcpspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 供求关系
    supply = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 发布时间
    create_time = scrapy.Field()
    # 发布单位
    unit = scrapy.Field()
    # 联系人
    contact = scrapy.Field()
    # 手机号码
    phone_number = scrapy.Field()
    # 地址
    address = scrapy.Field()
    # 详细地址
    detail_address = scrapy.Field()
    # 上市时间
    market_time = scrapy.Field()
    # 价格
    price = scrapy.Field()

 

 将settings.py改为false

技术图片

写spider爬虫文件nongcp_spider.py,进行字段解析使用xpath,正则表达式

# -*- coding: utf-8 -*-
import scrapy
import re
from ..items import NongcpspiderItem

class NongcpSpiderSpider(scrapy.Spider):
    name = nongcp_spider
    allowed_domains = [http://www.nongnet.com/]
    start_urls = [http://http://www.nongnet.com/]

    def parse(self, response):
        """
        解析详情页和下一页url
        :param response:
        :return:
        """
        detail_urls = response.xpath("//li[@class=‘lileft‘]/a/href").extract()
        for detail_url in detail_urls:
            yield scrapy.Request(url=self.start_urls[0] + detail_url, callback=self.detail_parse)

        next_url = response.xpath("//span[@id=‘ContentMain_lblPage‘]/a/href").extract()
        if next_url:
            yield scrapy.Request(url=self.start_urls[0] + next_url[-2], callback=self.parse)

    def detail_parse(self, response):
        """
        解析具体的数据
        :param response:
        :return:
        """
        items = NongcpspiderItem()
        title_result = response.xpath(//h1[@class="h1class"]/text()).extract_first()
        if title_result:
            items[supply] = title_result.strip()[1:2]
            items[title] = title_result.strip[3:]
        creatte_time = re.findall(r"<font color=‘999999‘>时间:(d+/d+/d+ d+:d+) &nbsp", response.text)
        if creatte_time:
            items[create_time] = creatte_time[0]
        unit = re.findall(r"发布单位</div><div class=‘xinxisxr‘><a href=‘.*?.aspx‘>(.*?)</a>", response.text)
        if unit:
            items[unit] = unit[0]
        price = response.xpath(//div[@class="scdbj1"]//text()).extract()
        if price:
            items[price] = ‘‘.join(price)
        yield items

编写pipelines.py,往mongodb里面存储数据

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo


class NongcpspiderPipeline(object):
    def process_item(self, item, spider):
        return item


class MongoPipeline(object):
    def __init__(self):
        client = pymongo.MongoClient(host=127.0.0.1, port=27017)
        db = client[nong]
        self.connection = db[Info]
        self.dbinfo = self.db.authenticate(baikai, 8Wxx.ypa)

    def process_item(self, item, spider):
        self.connection.save(dict(item))
        return item

 

 

 

 

 

 

 

---恢复内容结束---

以上是关于中国农产品信息网站scrapy-redis分布式爬取数据的主要内容,如果未能解决你的问题,请参考以下文章

scrapy-redis 分布式爬虫爬取房天下网站所有国内城市的新房和二手房信息

Scrapy-redis 组件

scrapy专题:scrapy-redis 框架分析

使用scrapy-redis搭建分布式爬虫环境

Scrapy-redis改造scrapy实现分布式多进程爬取

scrapy-redis实现分布式爬取知乎问答