CrawlSpiders

Posted cuzz

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了CrawlSpiders相关的知识,希望对你有一定的参考价值。

1.用 scrapy 新建一个 tencent 项目

2.在 items.py 中确定要爬去的内容

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # http://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class TencentItem(scrapy.Item):
12     # define the fields for your item here like:
13     # 职位
14     position_name = scrapy.Field()
15     # 详情链接
16     positin_link = scrapy.Field()
17     # 职业类别 
18     position_type = scrapy.Field()
19     # 招聘人数
20     people_number = scrapy.Field()
21     # 工作地点
22     work_location = scrapy.Field()
23     # 发布时间
24     publish_time = scrapy.Field()

 

3.快速创建 CrawlSpider模板

scrapy genspider -t crawl tencent_spider tencent.com

注意  此时中的名称不能与项目名相同

4.打开tencent_spider.py 编写代码

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 # 导入链接规则匹配类,用来提取符合规则的链接
 4 from scrapy.linkextractors import LinkExtractor
 5 # 导入CrawlSpider类和Rule
 6 from scrapy.spiders import CrawlSpider, Rule
 7 # 从tentcent项目下的itmes.py中导入TencentItem类
 8 from tencent.items import TencentItem
 9 
10 
11 class TencentSpiderSpider(CrawlSpider):
12     name = \'tencent_spider\'
13     allowed_domains = [\'hr.tencent.com\']
14     start_urls = [\'http://hr.tencent.com/position.php?&start=0#a\']
15     pagelink = LinkExtractor(allow=("start=\\d+")) # 正则匹配
16 
17     rules = (
18         # 获取这个列表的链接,依次发送请求,并继续跟进,调用指定的回调函数
19         Rule(pagelink, callback=\'parse_item\', follow=True),
20     )
21 
22     def parse_item(self, response):
23         for each in response.xpath("//tr[@class=\'even\'] | //tr[@class=\'odd\']"):
24             item = TencentItem()
25             # 职位名称
26             item[\'position_name\'] = each.xpath("./td[1]/a/text()").extract()[0]
27             # 详情连接
28             item[\'position_link\'] = each.xpath("./td[1]/a/@href").extract()[0]
29             # 职位类别
30             #item[\'position_type\'] = each.xpath("./td[2]/text()").extract()[0]
31             # 招聘人数
32             item[\'people_number\'] = each.xpath("./td[3]/text()").extract()[0]
33             # 工作地点
34             # item[\'work_location\'] = each.xpath("./td[4]/text()").extract()[0]
35             # 发布时间
36             item[\'publish_time\'] = each.xpath("./td[5]/text()").extract()[0]
37 
38             yield item

5.在 piplines.py 中写入文件

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don\'t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 import json
 9 
10 class TencentPipeline(object):
11     def open_spider(self, spider):
12         self.filename = open("tencent.json", "w")
13 
14     def process_item(self, item, spider):
15         text = json.dumps(dict(item), ensure_ascii = False) + "\\n"
16         self.filename.write(text.encode("utf-8")
17         return item
18 
19     def close_spider(self, spider):
20         self.filename.close()

7.在命令输入以下命令运行

scrapy crawl tencen_spider.py

 

出现以下问题在tencent_spider.py 文件中只有把position_type 和 work_location 注销掉才能运行...

以上是关于CrawlSpiders的主要内容,如果未能解决你的问题,请参考以下文章

Scrapy框架----- CrawlSpiders

python——CrawlSpiders类

爬虫框架Scrapy之CrawlSpiders

CrawlSpiders

CrawlSpiders

微信小程序代码片段