BeautifulSoup 笔记

Posted 河南骏

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了BeautifulSoup 笔记相关的知识,希望对你有一定的参考价值。




BeautifulSoup的基本使用

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 美化后补全输出
print(bs4.prettify())
# 输出title标签中的内容
print(bs4.title.string)

3、BeautifulSoup标签选择器的用法

3.1、选择元素

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 输出title标签  <title>The Dormouse's story</title>
print(bs4.title)
# 输出获取到title标签的类型  <class 'bs4.element.Tag'>
print(type(bs4.title))
# 输出head标签
print(bs4.head)
# 输出获取到head标签的类型    <class 'bs4.element.Tag'>
print(type(bs4.head))
# 获取到head标签中的title标签
print(bs4.head.title)
# 输出p标签(只输出第一个)
print(bs4.p)

从上述的代码中可以看出,BeautifulSoup解析出的标签返回任然是一个BeautifulSoup的Tag类,可以再次进行筛选

3.2、获取名称

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 获取选择的标签的名称  title
print(bs4.title.name)

3.3、获取属性

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 输出p标签的name属性值
print(bs4.p['name'])
# 输出p标签的name属性值
print(bs4.p.attrs['name'])

3.4、获取内容

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 输出title标签中的内容
print(bs4.title.string)
# 输出a标签中的内容(去除html标签包括注释)
print(bs4.a.string)

3.5、嵌套选择

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 输出head标签中的title标签中的内容
print(bs4.head.title.string)

3.6、子节点和子孙节点

3.6.1、contents

html = """
<html>
   <head>
       <title>The Dormouse's story</title>
   </head>
   <body>
       <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a href="http://example.com/elsie" class="sister" id="link1">
               <span>Elsie</span>
           </a>
           <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
           and
           <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
           and they lived at the bottom of a well.
       </p>
       <p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# p标签的子节点以列表的方式输出
print(bs4.p.contents)

3.6.2、children

from bs4 import BeautifulSoup

html = """
<html>
   <head>
       <title>The Dormouse's story</title>
   </head>
   <body>
       <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a href="http://example.com/elsie" class="sister" id="link1">
               <span>Elsie</span>
           </a>
           <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
           and
           <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
           and they lived at the bottom of a well.
       </p>
       <p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 获取p标签的所有子节点,返回一个 list 生成器对象
print(bs4.p.children)
# 对子节点进行遍历
fori, child inenumerate(bs4.p.children):
   print(i, child)

3.6.3、descendants

from bs4 import BeautifulSoup

html = """
<html>
   <head>
       <title>The Dormouse's story</title>
   </head>
   <body>
       <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a href="http://example.com/elsie" class="sister" id="link1">
               <span>Elsie</span>
           </a>
           <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
           and
           <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
           and they lived at the bottom of a well.
       </p>
       <p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 获取p标签的所有子节点(包含子孙节点),返回一个 list 生成器对象
print(bs4.p.descendants)
# 对子节点进行遍历
fori, child inenumerate(bs4.p.descendants):
   print(i, child)

3.7、父节点和祖先节点

3.7.1、parent

from bs4 import BeautifulSoup

html = """
<html>
   <head>
       <title>The Dormouse's story</title>
   </head>
   <body>
       <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a href="http://example.com/elsie" class="sister" id="link1">
               <span>Elsie</span>
           </a>
           <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
           and
           <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
           and they lived at the bottom of a well.
       </p>
       <p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 输出第一个a标签的父节点
print(bs4.a.parent)

3.7.2、parents

from bs4 import BeautifulSoup

html = """
<html>
   <head>
       <title>The Dormouse's story</title>
   </head>
   <body>
       <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a href="http://example.com/elsie" class="sister" id="link1">
               <span>Elsie</span>
           </a>
           <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
           and
           <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
           and they lived at the bottom of a well.
       </p>
       <p class="story">...</p>
"""

bs4 = BeautifulSoup(html,'lxml')
# 输出循环遍历出所有的祖先节点
fori, parent 爬虫BeautifulSoup库基本使用,案例解析(附源代码)

LeetCode Java刷题笔记—236. 二叉树的最近公共祖先

BeautifulSoup 笔记

Jupyter 笔记本中的 BeautifulSoup 和 lxml

「学习笔记」tarjan求最近公共祖先

《Python网络数据采集》笔记之BeautifulSoup