Python学习使用Feapder框架，编写爬虫，爬取中国工程院院士信息

Posted 2022-01-02 Mitch311

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python学习使用Feapder框架，编写爬虫，爬取中国工程院院士信息相关的知识，希望对你有一定的参考价值。

题目👇

学习使用Feapder框架，编写爬虫，爬取中国工程院院士信息。

（从http://www.cae.cn/cae/html/main/col48/column_48_1.html爬取中国工程院院士信息）

知识补充👇

①Feapder框架简介

和 Scrapy 类似，feapder 支持轻量级爬虫、分布式爬虫、批次爬虫、爬虫报警机制等功能

内置的 3 种爬虫如下：

AirSpider——轻量级爬虫，适合简单场景、数据量少的爬虫

Spider——分布式爬虫，基于 Redis，适用于海量数据，并且支持断点续爬、自动数据入库等功能

BatchSpider——分布式批次爬虫，主要用于需要周期性采集的爬虫

②Feapder框架的安装

首先我们需要安装一个Feapder库。python3版本可以直接在cmd下用pip3命令进行安装。

命令行输入：
pip3 install feapder
效果如图：

代码示例👇

首先在cmd命令行里创建爬虫项目

feapder create -p

命令行进入到 spiders 文件夹目录下，创建一个爬虫

cd spiders
# 创建一个轻量级爬虫
feapder create -s tophub_spider 1

#1 为默认，表示创建一个轻量级爬虫 AirSpider
#2 代表创建一个分布式爬虫 Spider
#3 代表创建一个分布式批次爬虫 BatchSpider

在spider.py中进行编程运行

#coding:utf-8
#author:Mitchell
#date:12.10

#part2:学习使用Feapder框架，编写爬虫，爬取中国工程院院士信息
import feapder
import re

#写入文件函数，规定写入的格式
def writer(filename, text):
    with open(filename, 'a', encoding='utf-8') as f:
        f.writelines(text)
        f.write('\\n\\n')

#轻量级爬虫 AirSpider
class Mitchell(feapder.AirSpider):
    def start_requests(self):
        url='https://www.cae.cn/cae/html/main/col48/column_48_1.html'
        yield feapder.Request(url)
    
    #爬取院士名单和对应的链接地址
    def parse_name(self, response):
        #利用xpath过滤得到目标标签
        name_list=response.xpath("//*[@class='name_list']")
        for name in name_list:
            #extract_first()：这个方法返回的是一个string字符串，是list数组里面的第一个字符串
            #得到对应院士的链接
            href=name.xpath('.//@href').extract_first()
            yield feapder.Request(href,callback=self.parse_next)
    
    #爬取院士信息
    def parse_content(self,request,response):
        #利用xpath过滤得到目标标签
        intro=response.xpath("//*[@class='intro']")
        #extract():这个方法返回的是一个数组list
        #注意这里的解析是Unicode编码
        t=intro.xpath(".//p[contains(text(),'\\u2002')]").extract()
        #转成字符串进行后续处理
        tt=''.join(t)
        regp = r'<p>([\\s\\S]+?)</p>'
        intropre = re.compile(regp)
        introlist=re.findall(intropre,tt)
        ttt=''
        for cont in introlist:
            ttt+=str(cont)
        #写入
        writer('工程院士信息.txt', ttt)

if __name__ == "__main__":
    Mitchell.start()

以上是关于Python学习使用Feapder框架，编写爬虫，爬取中国工程院院士信息的主要内容，如果未能解决你的问题，请参考以下文章