python爬虫学习记录解析库的使用——BeautifulSoup

Posted 玛卡巴卡巴巴亚卡

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python爬虫学习记录解析库的使用——BeautifulSoup相关的知识,希望对你有一定的参考价值。

一、概述

python中的一个html或xml解析器,可以从网页中提取数据。

beautifulsoup解析依赖解析器,除了python标准库中的HTML解析器,还有lxml,xml,htmllib

用法:

BeautifulSoup(markup,“html.parser/lxml/xml/html5lib”)

二、基本用法

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

由于html不完整,第一个参数是html字符串,第二个参数是解析器类型,将结果交付给soup,使用prettify方法,把要解析的字符串以标准的缩进格式输出。soup.title.string是输出html中title节点的文本内容。

三、节点选择器

1、选择元素

直接调用节点名称就可以选择节点元素,调用string就可以得到节点内文本

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)




<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

soup.title是选择title节点,类型是bs.element.Tag,这是BeautifulSoup中一个重要的数据结构。soup.title.string输出节点中内容,soup.head是head节点所有内容,soup.p是第一个p节点的内容。当有多个节点时,只会匹配第一个节点

2、提取信息

信息提取方法:

(1)获取名称

利用name属性获取节点名称

print(soup.title.name)

可以输出title

(2)获取属性

每个节点可以有多个属性,选择这个节点元素后,可以调用attrs获取所有属性。

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs)
print(soup.p.attrs['name'])

{'class': ['title'], 'name': 'dromouse'}
dromouse

atts返回字典格式。也可以不用输入attrs,直接输入soup.p['name']或者soup.p['class']即可。

如属性值是唯一的,则返回单个字符串,如果是一个节点中其中之一的属性,则返回列表。

(3)获取内容

利用string属性获取节点元素包含的文本内容,如要获得第一个p节点的文本。

print(soup.p.string)

The Dormouse's story

3、嵌套选择

每个返回结果都是bs4.element.Tag类型,同样可以继续调用节点进行下一步选择。

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

4、关联选择

(1)子节点和子孙节点

选取节点元素后,如果想要获取它的直接子节点,可以调用contents属性

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)

['\\n            Once upon a time there were three little sisters; and their names were\\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \\n            and\\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\\n            and they lived at the bottom of a well.\\n        ']

提取结果包含了p标签内的所有标签和内容,得到的是一个直接子节点的列表。

使用children也可以得到相同结果

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i , children in enumerate(soup.p.children):
    print(i,children)
   

<list_iterator object at 0x000001C2C7F7C700>
0
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
            and they lived at the bottom of a well.

得到所有子孙节点的话,可以使用descendants属性,结果如下:

<generator object Tag.descendants at 0x0000025ACDEB5A50>
0
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2

3 <span>Elsie</span>
4 Elsie
5

6

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12
            and they lived at the bottom of a well.

(2)父节点和祖先节点

获取某节点元素的父节点,可以调用parent属性:

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.parent)

<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body>

因为我们选取的是第一个的p节点,所以输出的就是body节点的所有内容。

要获取所有祖先节点,要使用parents属性:

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

<class 'generator'>
[(0, <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>)]

结果是生成器类型,列表输出了它的索引和内容,列表中元素就是a节点的祖先节点

(3)兄弟节点

from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
soup = BeautifulSoup(html,'lxml')
print('Next Sibling',soup.a.next_sibling)
print('Prev Sibling',soup.a.previous_sibling)
print('Next Siblings',list(enumerate(soup.a.next_siblings)))
print('Prev Siblings',list(enumerate(soup.a.previous_siblings)))

Next Sibling
            Hello
            
Prev Sibling
            Once upon a time there were three little sisters; and their names were
            
Next Siblings [(0, '\\n            Hello\\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \\n            and\\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\\n            and they lived at the bottom of a well.\\n        ')]
Prev Siblings [(0, '\\n            Once upon a time there were three little sisters; and their names were\\n            ')]

next_sibling和prev_sibling分别返回指定节点的下一个和上一个兄弟节点,next_siblings和prev_siblings则返回所有前后的兄弟节点的生成器

(4)获取信息

获取节点的关键信息,文本或属性,可以使用同样的方法

from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
soup = BeautifulSoup(html,'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

Next Sibling:
<class 'bs4.element.NavigableString'>

            Hello
            

            Hello
            
Parent:
<class 'generator'>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
            Hello
            <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
['story']

如果结果返回单个节点,可以直接调用string,attrs等属性获得文本和属性。如返回多个节点的生成器,可以转化成列表后取出某个元素,再调用string和attrs

5、方法选择器

(1)find_all()

查询所有符合条件的元素,给他传入一些属性或者文本,就可以得到符合条件的元素了。

find_all(name, attrs, recursive, text, **kwargs)

a、name

根据节点名查询元素

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

传入name参数,参数值为ul,查找所有ul节点,返回的是列表类型。

因为是Tag类型,依旧可以嵌套查询,找到所有li节点以及其中的文本内容

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in (ul.find_all(name='li')):
        print(li.string)

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

b、attrs

也可以传入属性来查询

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[]

传入attrs,参数类型是字典类型,要查询id为list-1的节点,可以传入attrs={‘id’:‘list-1’}的查询条件,得到结果是列表形式。

对于常用的,如id或class,也可以不用attrs,直接使用传参。但class需要加下划线,如class_='element',结果仍然是tag列表

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

c、text

text可以匹配节点的文本,可以传入字符串,也可以是正则表达式。

import re

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text=re.compile('link')))

['Hello, this is a link', 'Hello, this is a link, too']

(2)find()

只返回第一个匹配的元素。find_all返回所有的。但是返回的不再是列表,而是第一个匹配的节点元素,类型依旧是tag。

import re

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

find_parents 和 find_parent:前者返回所有祖先节点,后者返回直接父节点。

find_next_siblings 和 find_next_sibling:前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点。

find_previous_siblings 和 find_previous_sibling:前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点。

find_all_next 和 find_next:前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。

find_all_previous 和 find_previous:前者返回节点前所有符合条件的节点,后者返回第一个符合条件的节点。

6、CSS选择器

调用select()方法,传入相应的CSS选择器即可。

from bs4 import BeautifulSoup

html = """
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))#选择所有ul节点下所有li节点
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

(1)嵌套选择

select()方法支持嵌套选择。

先选择所有ul节点,再遍历每个ul节点,选择其li节点:

soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

(2)获取属性

for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])
list-1
list-1
list-2
list-2

可以直接传入中括号和属性,也可以通过attrs属性获得属性值。

(3)获取文本

可以使用string,也可以使用get_text()

for li in soup.select('li'):
    print('Get Text:',li.get_text())
    print('String:',li.string)

Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar

 

以上是关于python爬虫学习记录解析库的使用——BeautifulSoup的主要内容,如果未能解决你的问题,请参考以下文章

Python爬虫利器之解析库的使用

python爬虫学习记录基本库的使用——requests

python爬虫学习记录基本库的使用——urllib

python爬虫学习记录基本库的使用——正则表达式

Python爬虫学习笔记.Beautiful Soup库的使用

Python爬虫学习笔记.Beautiful Soup库的使用