scrapy怎么跟进爬取url

Posted 2023-03-04

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了scrapy怎么跟进爬取url相关的知识，希望对你有一定的参考价值。

参考技术A 这里有十篇文章，爬取了每篇我文章的标题作者之后，需要根据文章的链接去爬取文章的内容，该怎么做呢？这里我不解释过多，直接上代码吧：

# -*- coding: utf-8 -*-
from scrapy.spider import BaseSpider
from scrapy.selector import htmlXPathSelector
from scrapy.utils.url import urljoin_rfc
from scrapy.http import Request
from datacrawler.items import bbsItem

class bbsSpider(BaseSpider):

name = "bbs"
allowed_domains = ["bbs.nju.edu.cn"]
start_urls = [""]

def parseContent(self,content):
#content = content.encode('utf8')

authorIndex =content.index(unicode('信区','gbk'))
author = content[4:authorIndex-2]
boardIndex = content.index(unicode('标题','gbk'))
board = content[authorIndex+4:boardIndex-2]
timeIndex = content.index(unicode('南京大学小百合站 (','gbk'))
time = content[timeIndex+10:timeIndex+34]
content = content[timeIndex+38:]
return (author,board,time,content)
def parse2(self,response):
hxs =HtmlXPathSelector(response)
item = response.meta['item']
items = []

content = hxs.select('/html/body/center/table[1]//tr[2]/td/textarea/text()').extract()[0]
parseTuple = self.parseContent(content)
item['author'] = parseTuple[0]
item['board'] =parseTuple[1]
item['time'] = parseTuple[2]
item['content'] = parseTuple[3]
return item
def parse(self, response):
hxs = HtmlXPathSelector(response)

items = []
title= hxs.select('/html/body/center/table/tr[position()>1]/td[3]/a/text()').extract()
url= hxs.select('/html/body/center/table/tr[position()>1]/td[3]/a/@href').extract()
for i in range(0, 10):
item = bbsItem()
item['link'] = urljoin_rfc('', url[i])
item['title'] = title[i][:-1]
items.append(item)
for item in items:
yield Request(item['link'],meta='item':item,callback=self.parse2)

点赞
Python Scrapy 爬虫本回答被提问者采纳

以上是关于scrapy怎么跟进爬取url的主要内容，如果未能解决你的问题，请参考以下文章