CrawlSpiders
Posted cuzz
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了CrawlSpiders相关的知识,希望对你有一定的参考价值。
1.用 scrapy 新建一个 tencent 项目
2.在 items.py 中确定要爬去的内容
1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 # http://doc.scrapy.org/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class TencentItem(scrapy.Item): 12 # define the fields for your item here like: 13 # 职位 14 position_name = scrapy.Field() 15 # 详情链接 16 positin_link = scrapy.Field() 17 # 职业类别 18 position_type = scrapy.Field() 19 # 招聘人数 20 people_number = scrapy.Field() 21 # 工作地点 22 work_location = scrapy.Field() 23 # 发布时间 24 publish_time = scrapy.Field()
3.快速创建 CrawlSpider模板
scrapy genspider -t crawl tencent_spider tencent.com
注意 此时中的名称不能与项目名相同
4.打开tencent_spider.py 编写代码
1 # -*- coding: utf-8 -*- 2 import scrapy 3 # 导入链接规则匹配类,用来提取符合规则的链接 4 from scrapy.linkextractors import LinkExtractor 5 # 导入CrawlSpider类和Rule 6 from scrapy.spiders import CrawlSpider, Rule 7 # 从tentcent项目下的itmes.py中导入TencentItem类 8 from tencent.items import TencentItem 9 10 11 class TencentSpiderSpider(CrawlSpider): 12 name = \'tencent_spider\' 13 allowed_domains = [\'hr.tencent.com\'] 14 start_urls = [\'http://hr.tencent.com/position.php?&start=0#a\'] 15 pagelink = LinkExtractor(allow=("start=\\d+")) # 正则匹配 16 17 rules = ( 18 # 获取这个列表的链接,依次发送请求,并继续跟进,调用指定的回调函数 19 Rule(pagelink, callback=\'parse_item\', follow=True), 20 ) 21 22 def parse_item(self, response): 23 for each in response.xpath("//tr[@class=\'even\'] | //tr[@class=\'odd\']"): 24 item = TencentItem() 25 # 职位名称 26 item[\'position_name\'] = each.xpath("./td[1]/a/text()").extract()[0] 27 # 详情连接 28 item[\'position_link\'] = each.xpath("./td[1]/a/@href").extract()[0] 29 # 职位类别 30 #item[\'position_type\'] = each.xpath("./td[2]/text()").extract()[0] 31 # 招聘人数 32 item[\'people_number\'] = each.xpath("./td[3]/text()").extract()[0] 33 # 工作地点 34 # item[\'work_location\'] = each.xpath("./td[4]/text()").extract()[0] 35 # 发布时间 36 item[\'publish_time\'] = each.xpath("./td[5]/text()").extract()[0] 37 38 yield item
5.在 piplines.py 中写入文件
1 # -*- coding: utf-8 -*- 2 3 # Define your item pipelines here 4 # 5 # Don\'t forget to add your pipeline to the ITEM_PIPELINES setting 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 8 import json 9 10 class TencentPipeline(object): 11 def open_spider(self, spider): 12 self.filename = open("tencent.json", "w") 13 14 def process_item(self, item, spider): 15 text = json.dumps(dict(item), ensure_ascii = False) + "\\n" 16 self.filename.write(text.encode("utf-8") 17 return item 18 19 def close_spider(self, spider): 20 self.filename.close()
7.在命令输入以下命令运行
scrapy crawl tencen_spider.py
出现以下问题在tencent_spider.py 文件中只有把position_type 和 work_location 注销掉才能运行...
以上是关于CrawlSpiders的主要内容,如果未能解决你的问题,请参考以下文章