BeautifulSoup解析网页
Posted cord
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了BeautifulSoup解析网页相关的知识,希望对你有一定的参考价值。
from bs4 import BeautifulSoup
import requests
url = ‘http://dangjian.gmw.cn/node_11940.htm‘
html = requests.get(url).content
# prettify()用于格式化
soup = BeautifulSoup(html, ‘lxml‘)
# print(soup.prettify())
# print(soup.find_all(‘span‘, class_="channel-newsTime"))
resultSet = soup.find_all(‘ul‘, class_="channel-newsGroup")
urls = set()
for rs in resultSet:
# url = rs.a[‘href‘]
hrefs = rs.find_all(‘a‘)
for href in hrefs:
url = href[‘href‘]
if url.startswith("http"):
urls.add(url)
else:
urls.add("http://dangjian.gmw.cn/"+url)
print(urls)
for url in urls:
html = requests.get(url).content
soup = BeautifulSoup(html, ‘lxml‘)
title = soup.find(id="articleTitle").string
# parts = soup.find(id="contentMain")
parts = soup.select("div #contentMain > p")
content = ""
for part in parts:
content = content + part.string.__str__()
print(title)
print(content)
以上是关于BeautifulSoup解析网页的主要内容,如果未能解决你的问题,请参考以下文章
python爬虫--解析网页几种方法之BeautifulSoup
Python3.x:BeautifulSoup()解析网页内容出现乱码