通过使用scrapy爬取某学校全网
Posted longbigbeard
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了通过使用scrapy爬取某学校全网相关的知识,希望对你有一定的参考价值。
- 通过遍历全网url 实现爬取全网内容
- 忽略“.doc"等文件网页暂不处理,只爬取整个页面,不做清洗
- spider.py
# -*- coding: utf-8 -*- import scrapy from WSYU.items import WsyuItem import datetime from urllib import parse from scrapy.http import Request class WsyuSpider(scrapy.Spider): name = ‘wsyu‘ allowed_domains = [‘wsyu.edu.cn‘] # start_urls = [‘http://www.wsyu.edu.cn/‘,] start_urls = [‘http://www.wsyu.edu.cn/‘,] html_url_set = [] other_url_set =[] wenjian_end = ["@", ".pdf", ".jpg", ".gif", ".png", ".doc", ".xls", ".ppt", ".mp3", ".rar", ".zip",] def do_fiter(self,all_urls): for one_url in all_urls: if any(u in one_url for u in self.wenjian_end): self.other_url_set.append(one_url) else: pass return all_urls def parse(self, response): # 获取所有的地址链接 all_urls = response.xpath(‘//a/@href‘).extract() all_urls = [parse.urljoin(response.url,url) for url in all_urls] all_urls1 = self.do_fiter(all_urls) # all_urls2 = list(filter(lambda x:True if x.startswith(‘‘http‘) else False, all_urls1)) if all_urls1!=None: for one_url in all_urls1: if one_url not in self.html_url_set and one_url not in self.other_url_set: self.html_url_set.append(one_url) # yield self.make_requests_from_url(one_url) yield Request(parse.urljoin(response.url,one_url),callback=self.download_parse) # 回调函数默认为parse else: yield Request(url=self.html_url_set[-2],callback=self.parse) def download_parse(self,response): item = WsyuItem() item[‘url‘] = response.url # print(item[‘url‘]) item[‘content‘] = response.text # print(item[‘content‘]) item[‘create_time‘] = datetime.datetime.now() # print(item[‘create_time‘]) yield item # yield Request(url=response.url ,callback=self.parse) yield self.make_requests_from_url(response.url)
- 源代码放在github上了:https://github.com/longbigbeard/scrapy_demo/tree/master/WSYU
- 以上
以上是关于通过使用scrapy爬取某学校全网的主要内容,如果未能解决你的问题,请参考以下文章