python 爬虫学习第三课
Posted helenandyoyo
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 爬虫学习第三课相关的知识,希望对你有一定的参考价值。
python 爬虫学习之BeautifulSoup 模块
BeautifulSoup安装
基于python 3.7.0 安装。
pip install beautifulsoup4
BeautifulSoup演练
BeautifulSoup练习一
#---------------------BeautifulSoup练习一-------------------------
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="pname"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id="link3"))
for link in soup.find_all('a'):
print(link.get('href'))
print(soup.get_text())
print(soup.p.attrs['name'])
print(type(soup.p.attrs['class']))
print(type(soup.p['class']))
print(soup.a.string)
BeautifulSoup练习二
---------------------BeautifulSoup练习二-------------------------
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
print(soup.a.parent.name)
BeautifulSoup练习三
#---------------------BeautifulSoup练习三-------------------------
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
print(soup.find_all(attrs='id':'list-1'))
print(soup.find_all(attrs='name':'elements'))
print(soup.find_all(text='Foo'))
BeautifulSoup练习四
#---------------------BeautifulSoup练习四-------------------------
#爬取豆瓣读书(https://book.douban.com/)信息
# import requests
from bs4 import BeautifulSoup
import requests
response = requests.get('https://book.douban.com/')
soup = BeautifulSoup(response.content.decode('utf-8'), 'lxml')
book_len = len(soup.find_all('div',attrs='class':'title'))
book_list = []
for i in range(0, book_len):
book_dict =
book_dict['title'] = soup.find_all('div',attrs='class':'title')[i].a.string
book_dict['link'] = soup.find_all('div',attrs='class':'title')[i].a.get('href')
book_dict['author'] = soup.find_all('div',attrs='class':'author')[i].string
if not book_dict['author'] is None:
book_dict['author'] = book_dict['author'].strip()
else:
book_dict['author'] = '不详'
book_list.append(book_dict)
print(book_list)
注:学习资料来源 https://www.cnblogs.com/zhaof/p/6930955.html
以上是关于python 爬虫学习第三课的主要内容,如果未能解决你的问题,请参考以下文章