scrapy实战爬取cl社区评论数超过设定值的链接

Posted hougang

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了scrapy实战爬取cl社区评论数超过设定值的链接相关的知识,希望对你有一定的参考价值。

1、创建scrapy项目

scrapy startproject cl

2、前戏

  a、注释爬虫文件中的allowed_domains

  b、settings.py第22行,ROBOTSTXT_OBEY = True改为ROBOTSTXT_OBEY = False

  c、settings.py第19行,改为USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/71.0.3578.98 Safari/537.36‘

  d、开启管道:67-69行,

  ITEM_PIPELINES = {
       ‘mytestscrapy.pipelines.MytestscrapyPipeline‘: 300,
    }

3、cl.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from mytestscrapy.items import MytestscrapyItem
import time
import random

class TestCLSpider(scrapy.Spider):
    name = cl
    # allowed_domains = [‘www.baidu.com‘]
    start_urls = [https://cc.yyss.icu/thread0806.php?fid=2&search=&page=1]
    print("第1页开始")
    url = https://cc.yyss.icu/thread0806.php?fid=2&search=&page=%d
    pageNum = 1

    def parse(self, response):
        # response_text = response.text
        if self.pageNum == 1:
            tr_ele=Selector(response=response).xpath(//table[@id="ajaxtable"]/tbody[@style="table-layout:fixed;"]/tr[@class="tr3 t_one tac"])[2:]
        else:
            tr_ele=Selector(response=response).xpath(//table[@id="ajaxtable"]/tbody[@style="table-layout:fixed;"]/tr[@class="tr3 t_one tac"])

        for tr in tr_ele:
            count = tr.xpath(./td[4]/text()).extract_first()
            #过滤评论数小于4的
            if int(count) < 4:
                continue
            text = tr.xpath(./td[2]//a/text()).extract_first()
            url = https://cc.yyss.icu/+tr.xpath(./td[2]//a/@href).extract_first()
            item = MytestscrapyItem()
            item[urlname] = text
            item[urladdr] = url
            item[commentsNum] = count
            yield item
        #爬取1-30页数据
        if self.pageNum < 30:
            #每爬取一页数据,随机等待2-4秒
            time.sleep(random.randint(2,5))
            self.pageNum += 1
            new_url = format(self.url % self.pageNum)
            print("第%s页开始"%self.pageNum)
            yield scrapy.Request(url=new_url,callback=self.parse)

 

4.items.py

import scrapy


class MytestscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    urlname = scrapy.Field()
    urladdr = scrapy.Field()
    commentsNum = scrapy.Field()

5、pipelines.py(数据存入mysql数据库,mysql数据库cl_table表的字段urlname, urladdr, commentsNum)

import pymysql


class MytestscrapyPipeline(object):
    connect = ‘‘
    cursor = ‘‘
    def open_spider(self, spider):
        self.connect = pymysql.Connect(
            host=localhost,
            port=3306,
            user=root,
            passwd=123456,
            db=cl,
            charset=utf8
        )
    def process_item(self, item, spider):
        urlname = item[urlname]
        urladdr = item[urladdr]
        commentsNum = item[commentsNum]
        self.cursor = self.connect.cursor()
        sql = "INSERT INTO cl_table (urlname, urladdr, commentsNum) VALUES (‘%s‘,‘%s‘,‘%s‘ )"
        data = (urlname, urladdr, commentsNum)

        try:
            self.cursor.execute(sql % data)
        except Exception as e:
            self.connect.rollback()  # 事务回滚
            print(事务处理失败, e)
        else:
            self.connect.commit()  # 事务提交
            print(事务处理成功, self.cursor.rowcount)
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.connect.close()

 



以上是关于scrapy实战爬取cl社区评论数超过设定值的链接的主要内容,如果未能解决你的问题,请参考以下文章

Scrapy项目实战:爬取某社区用户详情

Scrapy爬取博客园精华区内容

Python网络爬虫Scrapy+MongoDB +Redis实战爬取腾讯视频动态评论教学视频

Scrapy框架实战爬取网易严选-苹果12手机热评

Scrapy爬虫爬取当当网图书畅销榜

Scrapy实战:爬取http://quotes.toscrape.com网站数据