如何使用 BeautifulSoup 在标签内获取 html 文本

Posted

技术标签:

【中文标题】如何使用 BeautifulSoup 在标签内获取 html 文本【英文标题】:How to get a html text inside tag using BeautifulSoup 【发布时间】:2021-12-23 00:31:44 【问题描述】:

如何使用beautifulsoup 从示例 html 中提取数据?

<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>

我尝试了.findall.get_text,但是我无法从htmlText 元素中提取文本值。

预期输出:

some thing ORget exact data from here

【问题讨论】:

【参考方案1】:

您可以使用 BeautifulSoup 两次,首先提取 htmlText 元素,然后解析内容。例如:

from bs4 import BeautifulSoup
import lxml

html = """
<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>
"""
soup = BeautifulSoup(html, "lxml")

for tag1 in soup.find_all("tag1"):
    cdata_html = tag1.htmltext.text
    cdata_soup = BeautifulSoup(cdata_html, "lxml")
    
    print(cdata_soup.p.text)

将显示的内容:

some thing ORget exact data from here

注意:lxml 也需要使用pip install lxml 安装。 BeautifulSoup 会自动导入这个。

【讨论】:

【参考方案2】:

以下是您需要执行的步骤:

# firstly, select all "htmlText" elements
soup.select("htmlText")


# secondly, iterate over all of them
for result in soup.select("htmlText"):
    # further code


# thirdly, use another BeautifulSoup() object to parse the data
# otherwise you can't access <p>, <lite> elements data
# since they are unreachable to first BeautifulSoup() object
for result in soup.select("htmlText"):
    final = BeautifulSoup(result.text, "lxml")


# fourthly, grab all <p> elements AND their .text -> "p.text"
for result in soup.select("htmlText"):
    final = BeautifulSoup(result.text, "lxml").p.text

代码和example in the online IDE(使用最易读的):

from bs4 import BeautifulSoup
import lxml

html = """
<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>
"""

soup = BeautifulSoup(html, "lxml")


# BeautifulSoup inside BeautifulSoup
unreadable_soup = BeautifulSoup(BeautifulSoup(html, "lxml").select_one('htmlText').text, "lxml").p.text
print(unreadable_soup)


example_1 = BeautifulSoup(soup.select_one('htmlText').text, "lxml").p.text
print(text_1)


# wihtout hardcoded list slices
for result in soup.select("htmlText"):
    example_2 = BeautifulSoup(result.text, "lxml").p.text
    print(example_2)


# or one liner
example_3 = ''.join([BeautifulSoup(result.text, "lxml").p.text for result in soup.select("htmlText")])
print(example_3)


# output
'''
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
'''

【讨论】:

以上是关于如何使用 BeautifulSoup 在标签内获取 html 文本的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 BeautifulSoup 从 HTML 中去除评论标签?

如何使用 BeautifulSoup 从父子标签中获取文本以放入 DOCX 表中

当父标签的子标签具有某些属性值时,如何使用 BeautifulSoup 获取父标签的名称值?

BeautifulSoup 仅提取***标签[重复]

如何在beautifulsoup中只获取完整的li标签? [复制]

如何使用 BeautifulSoup4 获取 <br> 标签之前的所有文本