CentOS7分布式部署pyspider

Posted 老胡的储物柜

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了CentOS7分布式部署pyspider相关的知识,希望对你有一定的参考价值。

1.搭建环境:

  • 系统版本:`Linux centos-linux

  • python版本: Python3.5.1

1.1.搭建python3环境:

本人在尝试过后选择集成环境Anaconda

1.1.1.编译
 
   
   
 
  1. # 下载依赖

  2. yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve

  3. # 下载python版本

  4. wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz

  5. # 或者使用国内源

  6. wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz

  7. mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src

  8. # 解压

  9. tar -zxf Python-3.5.1.tgz;cd Python-3.5.1

  10. # 编译安装

  11. ./configure --prefix=/usr/local/python3.5 --enable-shared

  12. make && make install

  13. # 建立软链接

  14. ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3

  15. echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf

  16. ldconfig

  17. # 验证python3

  18. python3

  19. # Python 3.5.1 (default, Oct  9 2016, 11:44:24)

  20. # [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux

  21. # Type "help", "copyright", "credits" or "license" for more information.

  22. # >>>

  23. # pip

  24. /usr/local/python3.5/bin/pip3 install --upgrade pip

  25. ln -s /usr/local/python3.5/bin/pip /usr/bin/pip

  26. # 本人在安装时出现问题 将pip重装

  27. wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate

  28. python get-pip.py

1.1.2.集成环境anaconda
 
   
   
 
  1. # 集成环境anaconda(推荐)

  2. wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh

  3. # 直接安装即可

  4. ./Anaconda3-4.2.0-Linux-x86_64.sh

  5. # 若出错,可能是解压失败

  6. yum install bzip2

1.2.安装mariaDB
 
   
   
 
  1. # 安装

  2. yum -y install mariadb mariadb-server

  3. # 启动

  4. systemctl start mariadb

  5. # 设置为开机启动

  6. systemctl enable mariadb

  7. # 配置密码 默认为空

  8. mysql_secure_installation

  9. # 登录

  10. mysql -u root -p

  11. # 创建一个用户 自己设定账户密码

  12. CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass';

  13. GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION;

  14. CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass';

  15. GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;

1.3.安装pyspider

本人使用 Anaconda

 
   
   
 
  1. # 搭建虚拟环境sbird python版本3.*

  2. conda create -n sbird python=3*

  3. # 进入环境

  4. source activate sbird

  5. # 安装pyspider

  6. pip install pyspider

  7. # 报错

  8. # it does not exist.  The exported locale is "en_US.UTF-8" but it is not supported

  9. # 执行 可写入.bashrc

  10. export LC_ALL=en_US.utf-8

  11. export LANG=en_US.utf-8

  12. #ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0)

  13. conda install pycurl

  14. # 退出

  15. source deactivate sbird

  16. # 若在虚拟机内 出现无法访问localhost:5000 可关闭防火墙

  17. systemctl stop firewalld.service

  18. #########直接运行源码==============

  19. mkdir git;cd git

  20. # 下载

  21. git clone https://github.com/binux/pyspider.git

  22. # 安装

  23. /root/anaconda3/envs/sbird/bin/python  /root/git/pyspider/run.py

其他方法

 
   
   
 
  1. # 搭建虚拟环境

  2. pip install virtualenv

  3. mkdir python;cd python

  4. # 创建虚拟环境pyenv3

  5. virtualenv -p /usr/bin/python3 pyenv3

  6. # 进入虚拟环境 激活环境

  7. cd pyenv3/

  8. source ./bin/activate

  9. pip install pyspider

  10. # 若pycurl报错

  11. yum install libcurl-devel

  12. # 继续

  13. pip install pyspider

  14. # 关闭

  15. deactivate

本人推荐用anaconda方式安装

若pyspider运行过程中出现错误,参考 anaconda安装部分,至此,访问 localhost:5000可看到页面。

1.4.安装Supervisor
 
   
   
 
  1. # 安装

  2. yum install supervisor -y

  3. # 若无法检索 则添加阿里的epel源

  4. vim /etc/yum.repos.d/epel.repo

  5. # 添加以下内容

  6. [epel]

  7. name=Extra Packages for Enterprise Linux 7 - $basearch

  8. baseurl=http://mirrors.aliyun.com/epel/7/$basearch

  9. http://mirrors.aliyuncs.com/epel/7/$basearch

  10. failovermethod=priority

  11. enabled=1

  12. gpgcheck=0

  13. gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7


  14. [epel-debuginfo]

  15. name=Extra Packages for Enterprise Linux 7 - $basearch - Debug

  16. baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug

  17. http://mirrors.aliyuncs.com/epel/7/$basearch/debug

  18. failovermethod=priority

  19. enabled=0

  20. gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

  21. gpgcheck=0


  22. [epel-source]

  23. name=Extra Packages for Enterprise Linux 7 - $basearch - Source

  24. baseurl=http://mirrors.aliyun.com/epel/7/SRPMS

  25. http://mirrors.aliyuncs.com/epel/7/SRPMS

  26. failovermethod=priority

  27. enabled=0

  28. gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

  29. gpgcheck=0

  30. # 安装

  31. yum install supervisor -y

  32. # 测试是否安装成功

  33. echo_supervisord_conf

1.4.1.Supervisor用法
 
   
   
 
  1. supervisord     #supervisor的服务器端部分 启动

  2. supervisorctl    #启动supervisor的命令行窗口

  3. # 假设创建进程pyspider01

  4. vim /etc/supervisord.d/pyspider01.ini

  5. # 写入以下内容

  6. [program:pyspider01]


  7. command      = /root/anaconda3/envs/sbird/bin/python  /root/git/pyspider/run.py

  8. directory    = /root/git/pyspider

  9. user         = root

  10. process_name = %(program_name)s

  11. autostart    = true

  12. autorestart  = true

  13. startsecs    = 3


  14. redirect_stderr         = true

  15. stdout_logfile_maxbytes = 500MB

  16. stdout_logfile_backups  = 10

  17. stdout_logfile          = /pyspider/supervisor/pyspider01.log

  18. # 重载

  19. supervisorctl reload

  20. # 启动

  21. supervisorctl start pyspider01

  22. # 也可这样启动

  23. supervisord -c /etc/supervisord.conf

  24. # 查看状态

  25. supervisorctl status

  26. # output

  27. pyspider01                       RUNNING   pid 4026, uptime 0:02:40

  28. # 关闭

  29. supervisorctl shutdown

1.5.安装redis
 
   
   
 
  1. # 消息队列采用redis

  2. mkdir download;cd download

  3. wget http://download.redis.io/releases/redis-3.2.4.tar.gz

  4. tar xzf redis-3.2.4.tar.gz

  5. cd redis-3.2.4

  6. make

  7. # 或者直接yum安装

  8. yum -y install redis

  9. # 启动

  10. systemctl start redis.service

  11. # 重启

  12. systemctl restart redis.service

  13. # 停止

  14. systemctl stop redis.service

  15. # 查看状态

  16. systemctl status redis.service

  17. # 更改文件/etc/redis.conf

  18. vim /etc/redis.conf

  19. # 更改内容

  20. daemonize no 改为 daemonize yes

  21. bind 127.0.0.1 改为 bind 10.211.55.22(当前服务器ip)

  22. # 重启redis

  23. systemctl restart redis.service

1.6.关于自启动
 
   
   
 
  1. # Supervisor添加到自启动服务

  2. systemctl enable supervisord.service

  3. # redis添加到自启动服务

  4. systemctl enable redis.service

  5. # 关闭防火墙自启动

  6. systemctl disable firewalld.service

至此,pyspider单个服务器运行环境搭建且部署完毕,启动 localhost:5000进入web界面。

也可编写脚本运行,在 /pyspider/supervisor/pyspider01.log查看运行状态。

2.分布式部署

刚才配置的服务器,将其命名为 centos01,按照这样的配置,再分别部署两台 centos02centos03

如下:

服务器名称 ip 说明
centos01 10.211.55.22 redis,mariaDB, scheduler
centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs
centos03 10.211.55.24 fetcher, processor,,result_worker,webui
2.1.centos01

进入服务器 centos01,经过第一步,基本环境已经搭好,首先编辑配置文件 /pyspider/config.json

 
   
   
 
  1. {

  2.  "taskdb": "mysql+taskdb://user_name:user_pass@10.211.55.22:3306/taskdb",

  3.  "projectdb": "mysql+projectdb://user_name:user_pass@10.211.55.22:3306/projectdb",

  4.  "resultdb": "mysql+resultdb://user_name:user_pass@10.211.55.22:3306/resultdb",

  5.  "message_queue": "redis://10.211.55.22:6379/db",

  6.  "logging-config": "/pyspider/logging.conf",

  7.  "phantomjs-proxy":"10.211.55.23:25555",

  8.  "webui": {

  9.    "username": "",

  10.    "password": "",

  11.    "need-auth": false,

  12.    "host":"10.211.55.24",

  13.    "port":"5000",

  14.    "scheduler-rpc":"http:// 10.211.55.22:5002",

  15.    "fetcher-rpc":"http://10.211.55.23:5001"

  16.  },

  17.  "fetcher": {

  18.    "xmlrpc":true,

  19.    "xmlrpc-host": "0.0.0.0",

  20.    "xmlrpc-port": "5001"

  21.  },

  22.  "scheduler": {

  23.    "xmlrpc":true,

  24.    "xmlrpc-host": "0.0.0.0",

  25.    "xmlrpc-port": "5002"

  26.  }

  27. }

尝试运行下:

 
   
   
 
  1. /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler

  2. # 报错

  3. ImportError: No module named 'mysql'

  4. # 下载 mysql-connector-python

  5. cd ~/git/

  6. git clone https://github.com/mysql/mysql-connector-python.git

  7. # 安装

  8. source activate sbird

  9. cd mysql-connector-python

  10. python setup.py install

  11. # 安装redis

  12. pip install redis

  13. source deactivate

  14. # 运行

  15. /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler

  16. # 输出 ok

  17. [I 161010 15:57:25 scheduler:644] scheduler starting...

  18. [I 161010 15:57:25 scheduler:779] scheduler.xmlrpc listening on 0.0.0.0:5002

  19. [I 161010 15:57:25 scheduler:583] in 5m: new:0,success:0,retry:0,failed:0

运行成功后,可直接更改 /etc/supervisord.d/pyspider01.ini如下:

 
   
   
 
  1. [program:pyspider01]


  2. command      = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler

  3. directory    = /root/git/pyspider

  4. user         = root

  5. process_name = %(program_name)s

  6. autostart    = true

  7. autorestart  = true

  8. startsecs    = 3


  9. redirect_stderr         = true

  10. stdout_logfile_maxbytes = 500MB

  11. stdout_logfile_backups  = 10

  12. stdout_logfile          = /pyspider/supervisor/pyspider01.log

 
   
   
 
  1. # 重载

  2. supervisorctl reload

  3. # 查看状态

  4. supervisorctl status

centos01部署完毕。

2.2.centos02

在 centos02中,需要运行 result_workerprocessorphantomjsfetcher

分别建立文件:

/etc/supervisord.d/result_worker.ini

 
   
   
 
  1. [program:result_worker]


  2. command      = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker

  3. directory    = /root/git/pyspider

  4. user         = root

  5. process_name = %(program_name)s

  6. autostart    = true

  7. autorestart  = true

  8. startsecs    = 3


  9. redirect_stderr         = true

  10. stdout_logfile_maxbytes = 500MB

  11. stdout_logfile_backups  = 10

  12. stdout_logfile          = /pyspider/supervisor/result_worker.log

/etc/supervisord.d/processor.ini

 
   
   
 
  1. [program:processor]


  2. command      = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor

  3. directory    = /root/git/pyspider

  4. user         = root

  5. process_name = %(program_name)s

  6. autostart    = true

  7. autorestart  = true

  8. startsecs    = 3


  9. redirect_stderr         = true

  10. stdout_logfile_maxbytes = 500MB

  11. stdout_logfile_backups  = 10

  12. stdout_logfile          = /pyspider/supervisor/processor.log

/etc/supervisord.d/phantomjs.ini

 
   
   
 
  1. [program:phantomjs]


  2. command      = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555

  3. directory    = /root/git/pyspider

  4. user         = root

  5. process_name = %(program_name)s

  6. autostart    = true

  7. autorestart  = true

  8. startsecs    = 3


  9. redirect_stderr         = true

  10. stdout_logfile_maxbytes = 500MB

  11. stdout_logfile_backups  = 10

  12. stdout_logfile          = /pyspider/supervisor/phantomjs.log

/etc/supervisord.d/fetcher.ini

 
   
   
 
  1. [program:fetcher]


  2. command      = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher

  3. directory    = /root/git/pyspider

  4. user         = root

  5. process_name = %(program_name)s

  6. autostart    = true

  7. autorestart  = true

  8. startsecs    = 3


  9. redirect_stderr         = true

  10. stdout_logfile_maxbytes = 500MB

  11. stdout_logfile_backups  = 10

  12. stdout_logfile          = /pyspider/supervisor/fetcher.log

在 pyspider目录中建立pjsconfig.json

 
   
   
 
  1. {

  2.  /*--ignore-ssl-errors=true */

  3.  "ignoreSslErrors": true,


  4.  /*--ssl-protocol=true */

  5.  "sslprotocol": "any",


  6.  /* Same as: --output-encoding=utf8 */

  7.  "outputEncoding": "utf8",


  8.  /* persistent Cookies. */

  9.  /*cookiesfile="e:/phontjscookies.txt",*/

  10.  cookiesfile="pyspider/phontjscookies.txt",


  11.  /* load image */

  12.  autoLoadImages = false

  13. }

下载phantomjs至 /pyspider/文件夹,将 git/pyspider/pyspider/fetcher/phantomjs_fetcher.js复制到 phantomjs_fetcher.js

 
   
   
 
  1. # 重载

  2. supervisorctl reload

  3. # 查看状态

  4. supervisorctl status

  5. # output

  6. fetcher                          RUNNING   pid 3446, uptime 0:00:07

  7. phantomjs                        RUNNING   pid 3448, uptime 0:00:07

  8. processor                        RUNNING   pid 3447, uptime 0:00:07

  9. result_worker                    RUNNING   pid 3445, uptime 0:00:07

centos02部署完毕。

2.3.centos03

部署这三个进程 fetcher,processor,result_worker和 centos02 一样,本服务器主要是在前面的基础上加上 webui

建立文件:

/etc/supervisord.d/webui.ini

 
   
   
 
  1. [program:webui]


  2. command      = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json webui

  3. directory    = /root/git/pyspider

  4. user         = root

  5. process_name = %(program_name)s

  6. autostart    = true

  7. autorestart  = true

  8. startsecs    = 3


  9. redirect_stderr         = true

  10. stdout_logfile_maxbytes = 500MB

  11. stdout_logfile_backups  = 10

  12. stdout_logfile          = /pyspider/supervisor/webui.log

 
   
   
 
  1. # 重载

  2. supervisorctl reload

  3. # 查看状态

  4. supervisorctl status

  5. # output

  6. fetcher                          RUNNING   pid 2724, uptime 0:00:07

  7. processor                        RUNNING   pid 2725, uptime 0:00:07

  8. result_worker                    RUNNING   pid 2723, uptime 0:00:07

  9. webui                            RUNNING   pid 2726, uptime 0:00:07

3.总结

访问 http://10.211.55.24:5000 即可,尽情爬取吧。


以上是关于CentOS7分布式部署pyspider的主要内容,如果未能解决你的问题,请参考以下文章

docker快速搭建分布式爬虫pyspider

centos7 pyspider环境安装

centos7.6下pyspider + python2.7安装

Windows部署pyspider指南

Pyspider爬虫简单框架

Python pyspider 安装与开发