python3: 爬虫---- urllib, beautifulsoup

Posted 2020-12-04 穿越王子

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python3: 爬虫---- urllib, beautifulsoup相关的知识，希望对你有一定的参考价值。

最近晚上学习爬虫，首先从基本的开始；

python3 将urllib,urllib2集成到urllib中了, urllib可以对指定的网页进行请求下载， beautifulsoup 可以从杂乱的html代码中

分离出我们需要的部分；

注： beautifulsoup 是一种可以从html 或XML文件中提取数据的python库；

实例1：

from urllib import request
from bs4 import BeautifulSoup as bs
import re

header = {
    \'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36\'
}

def download():
    """
     模拟浏览器进行访问；
    :param url:
    :return:
    """
    for pageIdx in range(1, 3, 1):
        #print(pageIdx)
        url = "https://www.cnblogs.com/#p%s" % str(pageIdx)
        print(url)
        req = request.Request(url, headers=header)
        rep = request.urlopen(req).read()
        data = rep.decode(\'utf-8\')
        print(data)
        content = bs(data)
        for link in content.find_all(\'h3\'):
            content1 = bs(str(link), \'html.parser\')
            print(content1.a[\'href\'],content1.a.string)
            curhtmlcontent = request.urlopen(request.Request(content1.a[\'href\'], headers=header)).read()
            #print(curhtmlcontent.decode(\'utf-8\'))
            open(\'%s.html\' % content1.a.string, \'w\',encoding=\'utf-8\').write(curhtmlcontent.decode(\'utf-8\'))

if __name__ == "__main__":
    download()

实例2：

# -- coding: utf-8 --
import unittest
import  lxml
import requests
from bs4 import BeautifulSoup as bs

def  school():
    for index in range(2, 34, 1):
        try:
            url="http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html" % str(index)
            r = requests.get(url=url)
            soup = bs(r.content, \'lxml\')
            city = soup.find_all(name="td",attrs={"colspan":"7"})[0].string
            fp = open("%s.txt" %(city), "w", encoding="utf-8")
            content1 = soup.find_all(name="tr", attrs={"height": "29"})
            for content2 in content1:
                try:
                    contentTemp = bs(str(content2), "lxml")
                    soup_content = contentTemp.find_all(name="td")[1].string
                    fp.write(soup_content + "\\n")
                    print(soup_content)
                except IndexError:
                    pass
            fp.close()
        except IndexError:
            pass


class MyTestCase(unittest.TestCase):
    def test_something(self):
        school()


if __name__ == \'__main__\':
    unittest.main()

BeatifulSoup支持很多HTML解析器（下面是一些主要的）：

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	(1)Python的内置标准库(2)执行速度适中(3)文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, “lxml”)	(1)速度快(2)文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, [“lxml”, “xml”]) OR BeautifulSoup(markup, “xml”)	(1)速度快(2)唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	(1)最好的容错性(2)以浏览器的方式解析文档(3)生成HTML5格式的文档	(1)速度慢(2)不依赖外部扩展

以上是关于python3: 爬虫---- urllib, beautifulsoup的主要内容，如果未能解决你的问题，请参考以下文章