scrapy怎么跟进爬取url

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了scrapy怎么跟进爬取url相关的知识,希望对你有一定的参考价值。

参考技术A 这里有十篇文章,爬取了每篇我文章的标题作者之后,需要根据文章的链接去爬取文章的内容,该怎么做呢?这里我不解释过多,直接上代码吧:

# -*- coding: utf-8 -*-
from scrapy.spider import BaseSpider
from scrapy.selector import htmlXPathSelector
from scrapy.utils.url import urljoin_rfc
from scrapy.http import Request
from datacrawler.items import bbsItem

class bbsSpider(BaseSpider):

name = "bbs"
allowed_domains = ["bbs.nju.edu.cn"]
start_urls = [""]

def parseContent(self,content):
#content = content.encode('utf8')

authorIndex =content.index(unicode('信区','gbk'))
author = content[4:authorIndex-2]
boardIndex = content.index(unicode('标 题','gbk'))
board = content[authorIndex+4:boardIndex-2]
timeIndex = content.index(unicode('南京大学小百合站 (','gbk'))
time = content[timeIndex+10:timeIndex+34]
content = content[timeIndex+38:]
return (author,board,time,content)
def parse2(self,response):
hxs =HtmlXPathSelector(response)
item = response.meta['item']
items = []

content = hxs.select('/html/body/center/table[1]//tr[2]/td/textarea/text()').extract()[0]
parseTuple = self.parseContent(content)
item['author'] = parseTuple[0]
item['board'] =parseTuple[1]
item['time'] = parseTuple[2]
item['content'] = parseTuple[3]
return item
def parse(self, response):
hxs = HtmlXPathSelector(response)

items = []
title= hxs.select('/html/body/center/table/tr[position()>1]/td[3]/a/text()').extract()
url= hxs.select('/html/body/center/table/tr[position()>1]/td[3]/a/@href').extract()
for i in range(0, 10):
item = bbsItem()
item['link'] = urljoin_rfc('', url[i])
item['title'] = title[i][:-1]
items.append(item)
for item in items:
yield Request(item['link'],meta='item':item,callback=self.parse2)

点赞
Python Scrapy 爬虫本回答被提问者采纳

以上是关于scrapy怎么跟进爬取url的主要内容,如果未能解决你的问题,请参考以下文章

Scrapy框架——CrawlSpider爬取某热线网站

scrapy spider官方文档

爬虫框架之Scrapy

Scrapy学习篇之Spiders

爬虫:Scrapy4 - Spiders

Scrapy之Spider