python3: 爬虫---- urllib, beautifulsoup
Posted 穿越王子
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python3: 爬虫---- urllib, beautifulsoup相关的知识,希望对你有一定的参考价值。
最近晚上学习爬虫,首先从基本的开始;
python3 将urllib,urllib2集成到urllib中了, urllib可以对指定的网页进行请求下载, beautifulsoup 可以从杂乱的html代码中
分离出我们需要的部分;
注: beautifulsoup 是一种可以从html 或XML文件中提取数据的python库;
实例1:
from urllib import request from bs4 import BeautifulSoup as bs import re header = { \'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36\' } def download(): """ 模拟浏览器进行访问; :param url: :return: """ for pageIdx in range(1, 3, 1): #print(pageIdx) url = "https://www.cnblogs.com/#p%s" % str(pageIdx) print(url) req = request.Request(url, headers=header) rep = request.urlopen(req).read() data = rep.decode(\'utf-8\') print(data) content = bs(data) for link in content.find_all(\'h3\'): content1 = bs(str(link), \'html.parser\') print(content1.a[\'href\'],content1.a.string) curhtmlcontent = request.urlopen(request.Request(content1.a[\'href\'], headers=header)).read() #print(curhtmlcontent.decode(\'utf-8\')) open(\'%s.html\' % content1.a.string, \'w\',encoding=\'utf-8\').write(curhtmlcontent.decode(\'utf-8\')) if __name__ == "__main__": download()
实例2:
# -- coding: utf-8 -- import unittest import lxml import requests from bs4 import BeautifulSoup as bs def school(): for index in range(2, 34, 1): try: url="http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html" % str(index) r = requests.get(url=url) soup = bs(r.content, \'lxml\') city = soup.find_all(name="td",attrs={"colspan":"7"})[0].string fp = open("%s.txt" %(city), "w", encoding="utf-8") content1 = soup.find_all(name="tr", attrs={"height": "29"}) for content2 in content1: try: contentTemp = bs(str(content2), "lxml") soup_content = contentTemp.find_all(name="td")[1].string fp.write(soup_content + "\\n") print(soup_content) except IndexError: pass fp.close() except IndexError: pass class MyTestCase(unittest.TestCase): def test_something(self): school() if __name__ == \'__main__\': unittest.main()
BeatifulSoup支持很多HTML解析器(下面是一些主要的):
解析器 | 使用方法 | 优势 | 劣势 |
Python标准库 | BeautifulSoup(markup, “html.parser”) | (1)Python的内置标准库(2)执行速度适中(3)文档容错能力强 | Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差 |
lxml HTML解析器 | BeautifulSoup(markup, “lxml”) | (1)速度快(2)文档容错能力强 | 需要安装C语言库 |
lxml XML解析器 | BeautifulSoup(markup, [“lxml”, “xml”]) OR BeautifulSoup(markup, “xml”) | (1)速度快(2)唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | (1)最好的容错性(2)以浏览器的方式解析文档(3)生成HTML5格式的文档 | (1)速度慢(2)不依赖外部扩展 |
以上是关于python3: 爬虫---- urllib, beautifulsoup的主要内容,如果未能解决你的问题,请参考以下文章