python从html标签中提取数据[重复]
Posted
技术标签:
【中文标题】python从html标签中提取数据[重复]【英文标题】:python extract data from html tags [duplicate] 【发布时间】:2018-05-07 00:14:24 【问题描述】:我想在Python中提取html标签内的(段落)
<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>
我的代码是
from HTMLParser import HTMLParser
from bs4 import BeautifulSoup
x = """<p style="text-align: justify;"><span style=& quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
p1 = HTMLParser()
p1.unescape(x)
bdy_soup = BeautifulSoup(p1.unescape(x)).get_text(separator=";")
print(bdy_soup)
此代码没有返回任何内容,请帮助我这样做,任何帮助将不胜感激
【问题讨论】:
你是从html页面还是文本文件中读取? @prakash-palnati --- 从 Sql 表中读取 @s.s 你可以使用BeautifulSoup
来提取你的精确数据。先做import html >>> html.unescape(x).
@manoj jadhav 你能解释一下代码吗
@s.s 查看我的帖子。
【参考方案1】:
-
使用
html.unescape
将html char转换为ascii
使用bs4.BeautifulSoup(html_content).text
提取内容
>>> x = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
>>> import html
>>> xx = html.unescape(x)
'<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">\n\n Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.\n\n </span></p>'
>>> import bs4
>>> bs4.BeautifulSoup(xx, "html").text
' Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. '
【讨论】:
这不行你能帮我解释一下代码 我修改了它。 @s.s 我修改了我的问题,请参阅 感谢您的帮助,我发布了我的答案【参考方案2】:你可以这样做。请先安装HTMLParser
和beautifulsoup4
。
from HTMLParser import HTMLParser
p = "<p style="text-align: justify;"><span
style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"
from bs4 import BeautifulSoup
p1 = HTMLParser()
p1.unescape(p)
bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
print bdy_soup
【讨论】:
p = "<p ......... gt;</p>" 中的文本 显示错误 @s.s 确切的输入是什么?你可以输入完整的sn-p吗? 正是这个=== <p style="text-align: justify;><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">不管你拥有什么样的小企业,使用传统的销售和营销策略可能会很昂贵。</span></p> 您的代码正在运行但没有返回任何输出...我需要打印什么...? 请添加print bdy_soup
。你在bdy_soup
得到什么【参考方案3】:
您可以使用正则表达式来提取两个 HTML 标签之间的数据
r'<title[^>]*>([^<]+)</title>'
【讨论】:
【参考方案4】:The code worked by installing lxml parser.. thankyou everyone for your help
import html
import bs4
import html.parser
import lxml
from bs4 import BeautifulSoup
x = """<p style="text-align: justify;"><span style=& quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
p1 = html.unescape(x)
bdy_soup = bs4.BeautifulSoup(p1, "lxml").get_text(separator="/n")
print(bdy_soup)
【讨论】:
以上是关于python从html标签中提取数据[重复]的主要内容,如果未能解决你的问题,请参考以下文章
在 Python 中使用 BeautifulSoup 从 HTML 脚本标签中提取 JSON