python爬虫入门

Posted 临风而眠

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python爬虫入门相关的知识,希望对你有一定的参考价值。

python爬虫入门(3)

​ Beautiful Soup 库入门

1.安装及测试

安装

​ 在cmd命令行中输入pip install beautifulsoup4,即可安装

测验

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")#获取网页源代码
demo = r.text
soup = BeautifulSoup(demo, "html.parser")#解析demo的解释器,解析得到一个变量
print(soup.prettify())

运行结果如下:

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

基本用法

from bs4 import BeautifulSoup
#bs4是beautifulsoup4  BeautifulSoup是一个类
soup = BeautifulSoup('<p>data</p>','html.parser')
#'<p>data</p>'是需要解析的html信息,  'html.parser'是解析器

2.BeautifulSoup库基本元素

①html标签

​ html有各种标签,可以看作"标签树"

标签基本结构如下:

image-20210710111338219

​ 属性由键值对构成

​ 而Beautiful Soup库正是解析、遍历、维护"标签树"的功能库

​ 如HTML,XML

②库的引用

​ Beautiful Soup库,也叫beautifulsoup4或bs4

引用方式

from bs4 import BeautifulSoup
#or
import bs4

③Beautiful Soup类

可认为下面三者等价

image-20210710164459245

image-20210710164623175

Beautiful Soup库解析器

image-20210710164740310

Beautiful Soup类的基本元素

image-20210710164913975

获取Tag
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
#print(r.text)
demo = r.text
soup = BeautifulSoup(demo, "html.parser")#解析demo的解释器
# print(soup.prettify())
print(soup.title)
tag = soup.a
print(tag)

运行结果

<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

​ html中a标签

​ 标签定义超链接,用于从一张页面链接到另一张页面。

​ 元素最重要的属性是 href 属性,它指示链接的目标。

获取Name

获取标签名字

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo , "html.parser")
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)

运行结果

a
p
body
获取Attributes

获取标签属性

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo , "html.parser")
tag = soup.a
print(tag.attrs)#标签属性.attrs
print(type(tag.attrs))#标签属性的类型 (字典类型)
print(tag.attrs['class'])#是字典,可以提取每一个属性的值
print(tag.attrs['href'])
print(type(tag))#标签的类型

运行结果

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
<class 'dict'>
['py1']
http://www.icourse163.org/course/BIT-268001
<class 'bs4.element.Tag'>

​ tag是bs4.element.Tag类型,是bs4给标签定义的一种特殊类型

获取NavigableString

navigablestring元素

​ 字符串类型,尖括号标签之间的字符串,可跨越多个标签层次

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup= BeautifulSoup(demo,"html.parser")
tag=soup.a
print(soup.a)
print(soup.a.string)
print(soup.p)
print(soup.p.string)
print(type(soup.p.string))

运行结果

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
<p class="title"><b>The demo python introduces several python courses.</b></p>
The demo python introduces several python courses.
<class 'bs4.element.NavigableString'>

​ Navigablestring也是bs4定义的类型

获取Comment

Comment类型

from bs4 import BeautifulSoup
soup2 = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(soup2.b.string)
print(type(soup2.b.string))
print(soup2.p.string)
print(type(soup2.p.string))

运行结果

This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>

​ 虽然是注释,但并没有标明是注释,(还是会提取注释)所以依据类型来判断比较好(bs4.element.Comment)

image-20210710235952774

​ 如果不希望提取注释,就要对类型进行相关判断

3.基于bs4库的HTML格式输出

​ 之前没有用bs4之前,爬下来的html都乱成一团,没有标签树的层次感,

​ 而bs4可以让其更好地显示

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")

demo = r.text
soup = BeautifulSoup(demo, "html.parser")#解析demo的解释器
print(soup.prettify())

运行结果

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

​ prettify()可以给html的标签和文本增加换行符

​ 还可以对标签进行处理

print(soup.a.prettify())

运行结果

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

​ bs4库将读入的html等都转化为了utf-8编码

image-20210711002012931

4.基于bs4库的HTML内容遍历方法

image-20210711002358562 image-20210711002233873

​ 树形结构

①下行遍历

image-20210711002502469

.contents

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.head)
print(soup.head.contents)
print(soup.body.contents)#节点不仅包括tag,还包括字符串等
print(soup.body.contents)#body的儿子节点的数量
print(soup.body.contents[1])#利用列表下标

运行结果

<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\\n']
['\\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\\n']
<p class="title"><b>The demo python introduces several python courses.</b></p>

.children

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
for child in soup.body.children:
    print(child)

运行结果

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

②上行遍历

image-20210711003351306
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title.parent)
print(soup.html.parent)#html是最高级别的,其parent就是自己
print(soup.parent)#是空的,无parent

运行结果

<head><title>This is a python demo page</title></head>
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
None

遍历一个标签的所有先辈标签

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

运行结果

p
body
html
[document]

③平行遍历

image-20210711004004467

​ 其中后两个是迭代类型,只能用在for in 中

平行遍历的条件:必须发生在同一个父节点下的各节点间

image-20210711004336530

​ 标签树的navigablestring也构成了标签树的节点

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.next_sibling)#标签树的navigablestring也构成了标签树的节点
print(soup.a.next_sibling.next_sibling)
print(soup.a.previous_sibling)
print(soup.a.previous_sibling.previous_sibling)#空信息

运行结果

 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

None

用循环来遍历

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo =r.text
soup = BeautifulSoup(demo,"html.parser")
for sibling in soup.a.next_siblings:
    print(sibling)

运行结果

 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.

​ 回顾

image-20210711005834258 image-20210711000523697

5.遇到的问题

①回顾上次

image-20210710103752870

上一篇文章里爬取IP的代码居然失效了🤔❓❓❓

​ 把try 和 except去掉,搜了一下报错,看到了这篇文章,解决了

原来是因为我把文件名重命名为了requests.py

​ 而奇怪的事情又发生了🤔,我重新改成requests_learnin

以上是关于python爬虫入门的主要内容,如果未能解决你的问题,请参考以下文章

python爬虫入门

如何入门 Python 爬虫?

python网络爬虫入门

python爬虫入门

如何入门 Python 爬虫

python爬虫如何入门