python网络爬虫之如何识别验证码

Posted 2020-10-21 一张红枫叶

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python网络爬虫之如何识别验证码相关的知识，希望对你有一定的参考价值。

有些网站的登录方式是验证码登录的方式，比如今天我们要测试的网站专利检索及分析。

http://www.pss-system.gov.cn/sipopublicsearch/portal/uilogin-forwardLogin.shtml

登录此类网站的关键是识别其中的验证码。那么如何识别验证码呢。我们首先来看下网页源代码。在网页中，验证码的是通过下载一个图片得到的。图片的下载地址是src=/sipopublicsearch/portal/login-showPic.shtml

我们从实际的fiddler抓包来看，也是通过请求上面的图片源地址得到了JPEG的图片并显示在浏览器上

那么在scrapy中我们首先就要将图片下载到本地，然后进行识别

def parse(self,response):

    ret=response.xpath(\'//*[@id="codePic"]/@src\').extract()

    image_source=ret[0]

    image_url=response.urljoin(image_source)

    r=requests.get(image_url)

    with open(\'E://scrapy_project/image2.JPEG\',"wb") as code:

        code.write(r.content)

首先提取src的值出来，然后使用requests的方法进行图片下载并保存。打开文件如下。

下一步就是开始识别图片中的验证码了，这就需要用到pytesser以及PIL库了。

首先是安装Tesseract-OCR，在网上下载后进行安装。默认安装路径是C:\\Program Files\\Tesseract-OCR。将该路径添加到 系统属性的path路径里面。

然后再通过pip安装pytesseract以及PIL。下面来看下如何使用。代码如下：

im=Image.open(\'E:\\\\scrapy_project\\\\image2.JPEG\')

im.convert(\'L\')

ret=image_to_string(im，config=\'-psm 7’)

print ret

结果如下：图片中的验证码已经被识别出来了

image_to_string要配置psm N,参数解释如下，一般我们选择第7个

-psm N

    Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

    0 = Orientation and script detection (OSD) only.

    1 = Automatic page segmentation with OSD.

    2 = Automatic page segmentation, but no OSD, or OCR.

    3 = Fully automatic page segmentation, but no OSD. (Default)

    4 = Assume a single column of text of variable sizes.

    5 = Assume a single uniform block of vertically aligned text.

    6 = Assume a single uniform block of text.

    7 = Treat the image as a single text line.

    8 = Treat the image as a single word.

    9 = Treat the image as a single word in a circle.

    10 = Treat the image as a single character.

E:\\python2.7.11\\python.exe E:/py_prj/test3.py

以上是关于python网络爬虫之如何识别验证码的主要内容，如果未能解决你的问题，请参考以下文章

爬虫遇到头疼的验证码？Python实战讲解弹窗处理和验证码识别

Python3网络爬虫实战-43极验滑动验证码的识别

Python3网络爬虫实战-41图形验证码的识别

Python3网络爬虫实战-42图形验证码的识别

Python3网络爬虫实战-44点触点选验证码的识别

Python3网络爬虫实战-45微博宫格验证码的识别