Pyspider上手

Posted jackzz

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Pyspider上手相关的知识,希望对你有一定的参考价值。

pyspider安装: pip3 install Pyspider

启动服务操作

1、打开cmd:输入        pyspider  --help 回车,可以查看帮助信息,pyspider all 启动command服务

2、启动后看到0.0.0.0.5000 提示就启动好了,打开浏览器127.0.0.1:5000或者http://localhost:5000/ 打开pyspider的web UI界面,

3、首先点击creat创建项目,名字任意

4、右边web页面代码如下:

#!/usr/bin/env python

# -*- encoding: utf-8 -*-
# Created on 2018-08-22 23:16:23
# Project: TripAdvisor

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
crawl_config = {
}

@every(minutes=24 * 60)
def on_start(self):
self.crawl(‘__START_URL__‘, callback=self.index_page)

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc(‘a[href^="http"]‘).items():
self.crawl(each.attr.href, callback=self.detail_page)

@config(priority=2)
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc(‘title‘).text(),
}

 

把__START_URL__替换成要爬取的网站地址,进行save,点击左边的run按钮,点击左边窗体的follow点击《、》箭头

第一次尝试pyspider,出师未捷身先死,,,599了,立马百度下PySpider HTTP 599: SSL certificate problem错误的解决方法,发现有同病相怜的小伙伴,学习下经验https://blog.csdn.net/asmcvc/article/details/51016485

报错完整的代码(每个人安装的python路径不一样地址会有差异)

[E 180822 23:51:45 base_handler:203] HTTP 599: SSL certificate problem: self signed certificate in certificate chain
    Traceback (most recent call last):
      File "e:programspythonpython36libsite-packagespyspiderlibsase_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "e:programspythonpython36libsite-packagespyspiderlibsase_handler.py", line 175, in _run_task
        response.raise_for_status()
      File "e:programspythonpython36libsite-packagespyspiderlibs
esponse.py", line 172, in raise_for_status
        six.reraise(Exception, Exception(self.error), Traceback.from_string(self.traceback).as_traceback())
      File "e:programspythonpython36libsite-packagessix.py", line 692, in reraise
        raise value.with_traceback(tb)
      File "e:programspythonpython36libsite-packagespyspiderfetcher	ornado_fetcher.py", line 378, in http_fetch
        response = yield gen.maybe_future(self.http_client.fetch(request))
      File "e:programspythonpython36libsite-packages	ornadohttpclient.py", line 102, in fetch
        self._async_client.fetch, request, **kwargs))
      File "e:programspythonpython36libsite-packages	ornadoioloop.py", line 458, in run_sync
        return future_cell[0].result()
      File "e:programspythonpython36libsite-packages	ornadoconcurrent.py", line 238, in result
        raise_exc_info(self._exc_info)
      File "<string>", line 4, in raise_exc_info
    Exception: HTTP 599: SSL certificate problem: self signed certificate in certificate chain

错误原因:

这个错误会发生在请求 https 开头的网址,SSL 验证错误,证书有误。

解决方法:

使用 self.crawl(url, callback=self.index_page, validate_cert=False)

这个方法基本可以解决问题了(浏览器要手动刷新下,用360安全浏览器貌似有这个小问题,可能是我设置的问题,果断换chrome和火狐试了下,没发现这个情况。。。)。

For Linux and MAC systems, please refer to the following links:

https://blog.csdn.net/WebStudy8/article/details/51610953


















以上是关于Pyspider上手的主要内容,如果未能解决你的问题,请参考以下文章

常用Python爬虫框架汇总

pyspider示例代码二:解析JSON数据

pyspider示例代码三:用PyQuery解析页面数据

pyspider示例代码:解析JSON数据

pyspider爬虫框架

利用 pyspider 框架抓取猫途鹰酒店信息