使用Beautifulsoup时删除标签
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用Beautifulsoup时删除标签相关的知识,希望对你有一定的参考价值。
我试图在Kodi抓一个网站获取个人脚本,我的代码正在运行,但是当BS呈现内容时,它仍然有标签。对于Python来说是新手,所以请寻找易于理解的答案。
当前输出:
<li>
<span style="font-family:trebuchet ms,helvetica,sans-serif;">
<span style="font-size:16px;color:#EFEFEF;">
04:30 - 05:30 The Tonight Show Starring Jimmy Fallon
<span style="color:#999999;">
- Channel 34
</span>
</span>
</span>
</li>
通缉输出:
04:30 - 05:30 The Tonight Show Starring Jimmy Fallon - Channel 34
我的代码:
import xbmcgui
import xbmcaddon
import urllib, urllib2, re, htmlParser, os
from bs4 import BeautifulSoup
pg_source = ''
req = urllib2.Request('http://rushmore.tv/schedule')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36')
try:
response = urllib2.urlopen(req)
pg_source = response.read().decode('utf-8' , 'ignore')
response.close()
except:
pass
content = []
soup = BeautifulSoup(pg_source)
content = BeautifulSoup(soup.find('ul', { 'id' : 'myUL' }).prettify())
xbmcgui.Dialog().textviewer(str(content), str(content))
xbmcgui.Window
谢谢。
答案
这可能有所帮助。我使用find_next
方法获得第二个跨度
from bs4 import BeautifulSoup
d = """<li>
<span style="font-family:trebuchet ms,helvetica,sans-serif;">
<span style="font-size:16px;color:#EFEFEF;">
04:30 - 05:30 The Tonight Show Starring Jimmy Fallon
<span style="color:#999999;">
- Channel 34
</span>
</span>
</span>
</li>"""
soup = BeautifulSoup(d, "html.parser")
print soup.find("li").span.find_next('span').get_text()
#or
print soup.find("li").find("span", {"style": "font-size:16px;color:#EFEFEF;"}).text
输出:
04:30 - 05:30 The Tonight Show Starring Jimmy Fallon - Channel 34
另一答案
尝试以下方法。它应该会得到你想要的结果。
import requests
from bs4 import BeautifulSoup
res = requests.get("http://rushmore.tv/schedule")
soup = BeautifulSoup(res.text,"lxml")
for content in soup.select("#myUL span[style*='#EFEFEF']"):
print(content.text)
部分输出:
16:00 - 20:00 Tennis: ATP Dubai - 13 (720P / US) & 21 (720P / CA)
16:00 - 20:00 Tennis: ATP Dubai - 13 (720P / US)
17:00 - 21:00 Golic and Wingo - Channel 03
18:45 - 23:00 Snooker: Welsh Open - Channel 105
20:00 - 23:30 The Dan Patrick Show - Channel 11
以上是关于使用Beautifulsoup时删除标签的主要内容,如果未能解决你的问题,请参考以下文章
解析目录中的 html 文件并使用 BeautifulSoup 删除特定标签
Python/BeautifulSoup - 如何从元素中删除所有标签?
使用 Google Refine/OpenRefine & Jsoup/BeautifulSoup 解析和删除 HTML 标签