Python3爬虫_使用Urllib进行网络爬取
Posted WittPeng
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python3爬虫_使用Urllib进行网络爬取相关的知识,希望对你有一定的参考价值。
网络爬虫
又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者,是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。
(参考百度百科,详细请见https://baike.baidu.com/item/网络爬虫/5162711?fr=aladdin&fromid=22046949&fromtitle=%E7%88%AC%E8%99%AB)
代码和步骤说明:借鉴http://cuijiahua.com。 https://blog.csdn.net/c406495762/article/details/58716886
Urllib
urllib是一个URL处理包,这个包中集合了一些处理URL的模块,如下:
- 打开和读取URL:urllib.request
- 包含request产生的错误,可以使用try进行捕捉处理:urllib.error
- 包含解析URLs的方法:urllib.parse
- urllib.robotparser模块用来解析robots.txt文本文件.它提供了一个单独的RobotFileParser类,通过该类提供的can_fetch()方法测试爬虫是否可以下载一个页面
urllib_test01.py
1 from urllib import request 2 3 if __name__=="__main__": 4 response=request.urlopen("http://i.cnblogs.com") 5 html=response.read() 6 print(html)
运行结果:
>>> RESTART: C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python36\\urllib_test01.py b\'\\r\\n<!DOCTYPE html>\\r\\n<html>\\r\\n<head>\\r\\n <meta charset="utf-8" />\\r\\n <meta name="viewport" content="width=device-width" />\\r\\n <title>\\xe7\\x94\\xa8\\xe6\\x88\\xb7\\xe7\\x99\\xbb\\xe5\\xbd\\x95 - \\xe5\\x8d\\x9a\\xe5\\xae\\xa2\\xe5\\x9b\\xad</title>\\r\\n <link rel="stylesheet" href="/scripts/bootstrap/css/bootstrap.min.css" />\\r\\n <link href="/scripts/ladda/ladda-themeless.min.css" rel="stylesheet" />\\r\\n <link href="/css/signin_bundle.css?v=L6jW_dned1XSxz8ohN2oMp1Q1fPUq1W5sWqqw6HNaH01" type="text/css" rel="stylesheet" /> \\r\\n <script src="/scripts/jquery.min.js"></script>\\r\\n <script src="/scripts/bootstrap/js/bootstrap.min.js"></script>\\r\\n <script src="/scripts/ladda/spin.min.js"></script>\\r\\n <script src="/scripts/ladda/ladda.min.js"></script>\\r\\n <script src="/scripts/jsencrypt.min.js"></script>\\r\\n <script>\\r\\n var return_url = \\\'http://i.cnblogs.com/\\\';\\r\\n var ajax_url = \\\'/user\\\' + \\\'/signin\\\';\\r\\n var enable_captcha = false;\\r\\n var is_in_progress = false;\\r\\n </script>\\r\\n <script src="/scripts/signin_bundle.js?v=1spnpY8gb0K9MfNetxJoLoPjd7dN7PIKB8kMqcak-RQ1"></script>\\r\\n\\r\\n</head>\\r\\n<body onload="setFocus()">\\r\\n <div style="width: 100%;">\\r\\n <div align="center">\\r\\n <div id="Main">\\r\\n <noscript>\\r\\n <div style="font-size:15px;margin-bottom:20px;">\\r\\n \\xe6\\x82\\xa8\\xe7\\x9a\\x84\\xe6\\xb5\\x8f\\xe8\\xa7\\x88\\xe5\\x99\\xa8\\xe6\\x9c\\xaa\\xe5\\x90\\xaf\\xe7\\x94\\xa8Javascript\\xef\\xbc\\x8c\\xe6\\x97\\xa0\\xe6\\xb3\\x95\\xe8\\xbf\\x9b\\xe8\\xa1\\x8c\\xe7\\x99\\xbb\\xe5\\xbd\\x95\\xe3\\x80\\x82\\r\\n </div>\\r\\n <style>\\r\\n form {\\r\\n display: none;\\r\\n }\\r\\n </style>\\r\\n </noscript>\\r\\n <form method="post" onsubmit="return false;">\\r\\n <div id="Heading">\\xe7\\x99\\xbb\\xe5\\xbd\\x95\\xe5\\x8d\\x9a\\xe5\\xae\\xa2\\xe5\\x9b\\xad - \\xe4\\xbb\\xa3\\xe7\\xa0\\x81\\xe6\\x94\\xb9\\xe5\\x8f\\x98\\xe4\\xb8\\x96\\xe7\\x95\\x8c</div>\\r\\n <div class="block">\\r\\n <label class="label-line">\\xe7\\x99\\xbb\\xe5\\xbd\\x95\\xe7\\x94\\xa8\\xe6\\x88\\xb7\\xe5\\x90\\x8d(<a href="/GetUsername.aspx" tabindex="-1" class="tb_right">\\xe6\\x89\\xbe\\xe5\\x9b\\x9e</a>)</label>\\r\\n <input type="text" id="input1" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input1" class="tip"></span>\\r\\n </div>\\r\\n <div class="block">\\r\\n <label class="label-line">\\xe5\\xaf\\x86\\xe7\\xa0\\x81(<a href="/GetMyPassword.aspx" tabindex="-1" class="tb_right">\\xe9\\x87\\x8d\\xe7\\xbd\\xae</a>)</label>\\r\\n <input type="password" id="input2" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input2" class="tip"></span>\\r\\n </div>\\r\\n\\r\\n <div class="modal fade" id="checkWay" tabindex="-1" role="dialog" aria-hidden="true">\\r\\n <div class="modal-dialog">\\r\\n <div class="modal-content center-block">\\r\\n <div class="modal-header">\\r\\n <button type="button" class="close" data-dismiss="modal"><span aria-hidden="true">×</span><span class="sr-only">Close</span></button>\\r\\n <h4 class="modal-title">\\r\\n \\xe8\\xaf\\xb7\\xe5\\xae\\x8c\\xe6\\x88\\x90\\xe4\\xba\\xba\\xe6\\x9c\\xba\\xe8\\xaf\\x86\\xe5\\x88\\xab\\xe9\\xaa\\x8c\\xe8\\xaf\\x81\\r\\n </h4>\\r\\n </div>\\r\\n <div class="modal-body">\\r\\n <div id="showLoading" class="ladda-button" data-style="zoom-in"></div>\\r\\n <div id="captchaBox" class="center-block">\\r\\n <span id="geetestLoading"> \\xe9\\xaa\\x8c\\xe8\\xaf\\x81\\xe7\\xa0\\x81\\xe7\\xbb\\x84\\xe4\\xbb\\xb6\\xe5\\x8a\\xa0\\xe8\\xbd\\xbd\\xe4\\xb8\\xad,\\xe8\\xaf\\xb7\\xe7\\xa8\\x8d\\xe5\\x90\\x8e...</span>\\r\\n </div>\\r\\n </div>\\r\\n </div>\\r\\n </div>\\r\\n </div>\\r\\n\\r\\n <div class="block">\\r\\n <input id="remember_me" type="checkbox" name="remember_me" onkeydown="check_enter(event)" /><label for="remember_me" onkeydown="check_enter(event)">\\xe4\\xb8\\x8b\\xe6\\xac\\xa1\\xe8\\x87\\xaa\\xe5\\x8a\\xa8\\xe7\\x99\\xbb\\xe5\\xbd\\x95</label>\\r\\n </div>\\r\\n <div class="block">\\r\\n <input type="submit" id="signin" class="button" value="\\xe5\\x8a\\xa0\\xe8\\xbd\\xbd\\xe4\\xb8\\xad..." /> <span id="tip_btn" class="tip"></span>\\r\\n </div>\\r\\n <div class="block nav">\\r\\n » <a href="/register.aspx?ReturnUrl=http://i.cnblogs.com/" title="\\xe6\\xb3\\xa8\\xe5\\x86\\x8c\\xe6\\x88\\x90\\xe4\\xb8\\xba\\xe5\\x8d\\x9a\\xe5\\xae\\xa2\\xe5\\x9b\\xad\\xe7\\x94\\xa8\\xe6\\x88\\xb7">\\xe7\\xab\\x8b\\xe5\\x8d\\xb3\\xe6\\xb3\\xa8\\xe5\\x86\\x8c</a><br />\\r\\n » <a href="http://www.cnblogs.com/ContactUs.aspx">\\xe5\\x8f\\x8d\\xe9\\xa6\\x88\\xe9\\x97\\xae\\xe9\\xa2\\x98</a>\\r\\n </div>\\r\\n </form>\\r\\n <div style="clear: both" />\\r\\n </div>\\r\\n </div>\\r\\n </div>\\r\\n</body>\\r\\n</html>\\r\\n\' >>> |
我们爬取完网站后,得到的是一堆二进制码。按照正常的流程,浏览器从服务器端得到信息后会进行解析,然后展示给我们看。而我们现在可以通过简单的decode()命令将网页的信息进行解码,并显示出来,更新代码为:
from urllib import request if __name__=="__main__": response=request.urlopen("http://i.cnblogs.com") html=response.read() html = html.decode("utf-8") print(html)
展示为:
Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] on win32
<html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width" /> <title>用户登录 - 博客园</title> <link rel="stylesheet" href="/scripts/bootstrap/css/bootstrap.min.css" /> <link href="/scripts/ladda/ladda-themeless.min.css" rel="stylesheet" /> <link href="/css/signin_bundle.css?v=L6jW_dned1XSxz8ohN2oMp1Q1fPUq1W5sWqqw6HNaH01" type="text/css" rel="stylesheet" /> <script src="/scripts/jquery.min.js"></script> <script src="/scripts/bootstrap/js/bootstrap.min.js"></script> <script src="/scripts/ladda/spin.min.js"></script> <script src="/scripts/ladda/ladda.min.js"></script> <script src="/scripts/jsencrypt.min.js"></script> <script> var return_url = \'http://i.cnblogs.com/\'; var ajax_url = \'/user\' + \'/signin\'; var enable_captcha = false; var is_in_progress = false; </script> <script src="/scripts/signin_bundle.js?v=1spnpY8gb0K9MfNetxJoLoPjd7dN7PIKB8kMqcak-RQ1"></script>
</head> <body onload="setFocus()"> <div style="width: 100%;"> <div align="center"> <div id="Main"> <noscript> <div style="font-size:15px;margin-bottom:20px;"> 您的浏览器未启用Javascript,无法进行登录。 </div> <style> form { display: none; } </style> </noscript> <form method="post" onsubmit="return false;"> <div id="Heading">登录博客园 - 代码改变世界</div> <div class="block"> <label class="label-line">登录用户名(<a href="/GetUsername.aspx" tabindex="-1" class="tb_right">找回</a>)</label> <input type="text" id="input1" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input1" class="tip"></span> </div> <div class="block"> <label class="label-line">密码(<a href="/GetMyPassword.aspx" tabindex="-1" class="tb_right">重置</a>)</label> <input type="password" id="input2" value="" class="input-text" onkeydown="check_enter(event)" /> <span id="tip_input2" class="tip"></span> </div>
<div class="modal fade" id="checkWay" tabindex="-1" role="dialog" aria-hidden="true"> <div class="modal-dialog"> <div class="modal-content center-block"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal"><span aria-hidden="true">×</span><span class="sr-only">Close</span></button> <h4 class="modal-title"> 请完成人机识别验证 </h4> </div> <div class="modal-body"> <div id="showLoading" class="ladda-button" data-style="zoom-in"></div> <div id="captchaBox" class="center-block"> <span id="geetestLoading"> 验证码组件加载中,请稍后...</span> </div> </div> </div> </div> </div>
<div class="block"> <input id="remember_me" type="checkbox" name="remember_me" onkeydown="check_enter(event)" /><label for="remember_me" onkeydown="check_enter(event)">下次自动登录</label> </div> <div class="block"> <input type="submit" id="signin" class="button" value="加载中..." /> <span id="tip_btn" class="tip"></span> </div> <div class="block nav"> » <a href="/register.aspx?ReturnUrl=http://i.cnblogs.com/" title="注册成为博客园用户">立即注册</a><br /> » <a href="http://www.cnblogs.com/ContactUs.aspx">反馈问题</a> </div> </form> <div style="clear: both" /> </div> </div> </div> </body> </html>
|
自动获取网页编码方式的方法
安装第三方库chardet,它是用来判断编码的模块,打开cmd,只需要输入指令:
pip install chardet
即可进行下载。
新的代码:
# -*- coding: UTF-8 -*- from urllib import request import chardet if __name__ == "__main__": response = request.urlopen("http://i.cnblogs.com/") html = response.read() charset = chardet.detect(html) print(charset)
返回的结果是一个字典,会告知我们网页的编码方式。
以上是关于Python3爬虫_使用Urllib进行网络爬取的主要内容,如果未能解决你的问题,请参考以下文章