如何使用标签提取标签内的文本？

Posted 2023-05-07

技术标签:

【中文标题】如何使用标签提取标签内的文本？【英文标题】：how to extract text inside a tag with its tags? 【发布时间】：2020-02-08 17:36:33 【问题描述】：

我想使用beautifulsoup 解析一个html 页面。我想在不删除内部 html 标签的情况下提取标签内的文本。例如样本输入：

<a class="fl" href="https://***.com/questio...">
    Angular2 <b>Router link not working</b>
</a>

样本输出：

'Angular2 <b>Router link not working</b>'

我试过这个：

from bs4 import Beautifulsoup
string = '<a class="fl" href="https://***.com/questio...">
         Angular2 <b>Router link not working</b>
         </a>'
soup = Beautifulsoup(string, 'html.parser')
print(soup.text)

但它给出了：

'Angular2 Router link not working'

如何在不删除内部标签的情况下提取文本？

【问题讨论】：

您是否尝试过不将解析器传递给Beautifulsoup 构造函数然后转换为字符串？已经在这里回答：***.com/questions/8112922/beautifulsoup-innerhtml @helenej 感谢您的回复。我试过了，但没有用。它再次给<a class...>An...</a>。 【参考方案1】：

来自here 的第一个答案工作正常。对于这个例子：

from bs4 import Beautifulsoup
string = '<a class="fl" href="https://***.com/questio...">
             Angular2 <b>Router link not working</b>
         </a>'
soup = BeautifulSoup(string, 'html.parser')
soup.find('a').encode_contents().decode('utf-8')

它给出：

'Angular2 <b>Router link not working</b>'

【讨论】：

干得好@hamid。我试图使用.encode_contents()，但它也回馈了外部标签。我看到您必须指定 .find('a') 才能执行您需要的操作。感谢您发布您自己问题的解决方案，这非常有益！【参考方案2】：

当您编写print(soup.text) 时，您正在从标签“a”中提取所有文本，包括其中的每个标签。如果您只想获取标签“b”对象，您应该尝试下一步：

soup = BeautifulSoup(string, 'html.parser')
b = soup.find('b')
print(b)
print(type(b))

或

soup = BeautifulSoup(string, 'html.parser')
b = soup.find('a', class_="fl").find('b')
print(b)
print(type(b))

输出：

<b>Router link not working</b>
<class 'bs4.element.Tag'>

如您所见，它将在 beautifullsoup 对象中返回您的标签“b”

如果你需要字符串格式的数据，你可以写：

b = soup.find('a', class_="fl").find('b')
b = str(b)
print(b)
print(type(b))

输出：

<b>Router link not working</b>
<class 'str'>

【讨论】：

这个答案给出了<b> 的唯一内部，并在此示例Angular2 中删除了文本的第一部分。我想保留整个文本及其内部标签。【参考方案3】：

正如 Den 所说，您需要获取该内部标签，然后将其存储为类型 str 以包含该内部标签。在 Den 给定的解决方案中，它将专门获取 <b> 标签，而不是父标签/文本，如果其中有其他样式类型的标签，则不会。但是如果还有其他标签，你可以更笼统，让它找到<a>标签的子元素，而不是专门寻找<b>标签。

所以基本上这将做的是找到<a> 标签并获取整个文本。然后它将进入该 <a> 标记的子项，将其转换为字符串，然后用该字符串（包括标记）替换该父文本中的文本

string = '''<a class="fl" href="https://***.com/questio...">
     Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>
     </a>'''



from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup(string, 'html.parser')
parsed_soup = ''

for item in soup.find_all('a'):
    if type(item) is Tag and 'a' != item.name:
        continue
    else:
        try:
            parent = item.text.strip()
            child_elements = item.findChildren()
            for child_ele in child_elements:
                child_text = child_ele.text
                child_str = str(child_ele)


                parent = parent.replace(child_text, child_str)
        except:
            parent = item.text

print (parent)

输出：

print (parent)
Angular2 <b>Router link not working</b> and then this is in <i>italics</i> and this is in <b>bold</b>

【讨论】：

以上是关于如何使用标签提取标签内的文本？的主要内容，如果未能解决你的问题，请参考以下文章