9.3.4 BeaufitulSoup4
Posted Avention
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了9.3.4 BeaufitulSoup4相关的知识,希望对你有一定的参考价值。
BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从html或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。
使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。
下面简单演示下BeautifulSoup4的功能,更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。
1 >>> from bs4 import BeautifulSoup
2 >>>
3 >>> #自动添加和补全标签
4 >>> BeautifulSoup(‘hello world‘,‘lxml‘)
5 <html><body><p>hello world</p></body></html>
6 >>>
7 >>> #自定义一个html文档内容
8 >>> html_doc = """
9 <html><head><title>The Dormouse‘s story</title></head>
10 <body>
11 <p class="title"><b>The Dormouse‘s story</b></p>
12 <p class="story">Once upon a time there were three little sisters;and their names were
13 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
14 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
15 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
16 and they lived at the bottom of a well.</p>
17
18 <p class="story">...</p>
19 """
20 >>>
21 >>> #解析这段html文档内容,以优雅的方式展示出来
22 >>> soup = BeautifulSoup(html_doc,‘html.parser‘)
23 >>> print(soup.prettify())
24 <html>
25 <head>
26 <title>
27 The Dormouse‘s story
28 </title>
29 </head>
30 <body>
31 <p class="title">
32 <b>
33 The Dormouse‘s story
34 </b>
35 </p>
36 <p class="story">
37 Once upon a time there were three little sisters;and their names were
38 <a class="sister" href="http://example.com/elsie" id="link1">
39 Elsie
40 </a>
41 ,
42 <a class="sister" href="http://example.com/lacie" id="link2">
43 Lacie
44 </a>
45 and
46 <a class="sister" href="http://example.com/tillie" id="link3">
47 Tillie
48 </a>
49 ;
50 and they lived at the bottom of a well.
51 </p>
52 <p class="story">
53 ...
54 </p>
55 </body>
56 </html>
57 >>>
58 >>> #访问特定标签
59 >>> soup.title
60 <title>The Dormouse‘s story</title>
61 >>>
62 >>> #标签名字
63 >>> soup.title.name
64 ‘title‘
65 >>>
66 >>> #标签文本
67 >>> soup.title.text
68 "The Dormouse‘s story"
69 >>>
70 >>> #title标签的上一级标签
71 >>> soup.title.parent
72 <head><title>The Dormouse‘s story</title></head>
73 >>>
74 >>> soup.head
75 <head><title>The Dormouse‘s story</title></head>
76 >>>
77 >>> soup.b
78 <b>The Dormouse‘s story</b>
79 >>>
80 >>> soup.b.name
81 ‘b‘
82 >>> soup.b.text
83 "The Dormouse‘s story"
84 >>>
85 >>> #把整个BeautifulSoup对象看作标签对象
86 >>> soup.name
87 ‘[document]‘
88 >>>
89 >>> soup.body
90 <body>
91 <p class="title"><b>The Dormouse‘s story</b></p>
92 <p class="story">Once upon a time there were three little sisters;and their names were
93 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
94 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
95 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
96 and they lived at the bottom of a well.</p>
97 <p class="story">...</p>
98 </body>
99 >>>
100 >>> soup.p
101 <p class="title"><b>The Dormouse‘s story</b></p>
102 >>>
103 >>> #标签属性
104 >>> soup.p[‘class‘]
105 [‘title‘]
106 >>>
107 >>> soup.p.get(‘class‘) #也可以这样查看标签属性
108 [‘title‘]
109 >>>
110 >>> soup.p.text
111 "The Dormouse‘s story"
112 >>>
113 >>> soup.p.contents
114 [<b>The Dormouse‘s story</b>]
115 >>>
116 >>> soup.a
117 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
118 >>>
119 >>> #查看a标签所有属性
120 >>> soup.a.attrs
121 {‘class‘: [‘sister‘], ‘id‘: ‘link1‘, ‘href‘: ‘http://example.com/elsie‘}
122 >>>
123 >>> #查找所有a标签
124 >>> soup.find_all(‘a‘)
125 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
126 >>>
127 >>> #同时查找<a>和<b>标签
128 >>> soup.find_all([‘a‘,‘b‘])
129 [<b>The Dormouse‘s story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
130 >>>
131 >>> import re
132 >>> #查找href包含特定关键字的标签
133 >>> soup.find_all(href=re.compile("elsie"))
134 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
135 >>>
136 >>> soup.find(id=‘link3‘)
137 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
138 >>>
139 >>> soup.find_all(‘a‘,id=‘link3‘)
140 [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
141 >>>
142 >>> for link in soup.find_all(‘a‘):
143 print(link.text,‘:‘,link.get(‘href‘))
144
145
146 Elsie : http://example.com/elsie
147 Lacie : http://example.com/lacie
148 Tillie : http://example.com/tillie
149 >>>
150 >>> print(soup.get_text()) #返回所有文本
151
152 The Dormouse‘s story
153
154 The Dormouse‘s story
155 Once upon a time there were three little sisters;and their names were
156 Elsie,
157 Lacieand
158 Tillie;
159 and they lived at the bottom of a well.
160 ...
161
162 >>>
163 >>> #修改标签属性
164 >>> soup.a[‘id‘]=‘test_link1‘
165 >>> soup.a
166 <a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
167 >>>
168 >>> #修改标签文本
169 >>> soup.a.string.replace_with(‘test_Elsie‘)
170 ‘Elsie‘
171 >>>
172 >>> soup.a.string
173 ‘test_Elsie‘
174 >>>
175 >>> print(soup.prettify())
176 <html>
177 <head>
178 <title>
179 The Dormouse‘s story
180 </title>
181 </head>
182 <body>
183 <p class="title">
184 <b>
185 The Dormouse‘s story
186 </b>
187 </p>
188 <p class="story">
189 Once upon a time there were three little sisters;and their names were
190 <a class="sister" href="http://example.com/elsie" id="test_link1">
191 test_Elsie
192 </a>
193 ,
194 <a class="sister" href="http://example.com/lacie" id="link2">
195 Lacie
196 </a>
197 and
198 <a class="sister" href="http://example.com/tillie" id="link3">
199 Tillie
200 </a>
201 ;
202 and they lived at the bottom of a well.
203 </p>
204 <p class="story">
205 ...
206 </p>
207 </body>
208 </html>
209 >>>
210 >>>
211 >>> #遍历子标签
212 >>> for child in soup.body.children:
213 print(child)
214
215
216
217
218 <p class="title"><b>The Dormouse‘s story</b></p>
219
220
221 <p class="story">Once upon a time there were three little sisters;and their names were
222 <a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
223 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
224 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
225 and they lived at the bottom of a well.</p>
226
227
228 <p class="story">...</p>
229
230
231 >>>
以上是关于9.3.4 BeaufitulSoup4的主要内容,如果未能解决你的问题,请参考以下文章