Python - 将 HTML 超链接转换为格式化的纯文本

Posted 2023-02-24

技术标签:

【中文标题】Python - 将 HTML 超链接转换为格式化的纯文本【英文标题】：Python - Converting HTML hyperlinks to formatted plain text 【发布时间】：2022-01-18 10:15:33 【问题描述】：

如何使用 Python 将 html 超链接转换为纯文本，如下所示：

<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>

我当前的代码看起来像这样，但是这个包本身似乎并没有完成这项工作，因为它们只是将主要的 HTML 文本元素转换为没有链接的纯文本：

from html2text import html2text

text = html2text("<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>")
print(text)

# Result I wanted: "Hello world, it's foo bar time - https://google.com/"
# Result I got: "Hello world, it's foo bar time"

如果找到解决方案真的会提供帮助。

【问题讨论】：

Aizak，这看起来像是一个有趣的 Python 小谜题：您是否考虑过自己实现该解决方案？仅使用 Python 内置程序和 stdlib 有许多可能的方法。例如，您可以遍历 HTML 字符串中的所有字符，当您点击标记“url 的单独变量中。对于不同的方法，您可以使用re 包以与上述类似的方式从输入字符串中捕获和转换。我确实尝试了一堆包含多种类型元素的正则表达式，但我对如何将两件事放在中心位置一无所知：文本和链接。 【参考方案1】：

你可以看看html.parser，这个lib绝对可以满足你的需求。

文档中的示例：

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

parser = MyHTMLParser()

【讨论】：

【参考方案2】：

你可以用美汤(bs4 package)

from bs4 import BeautifulSoup

spam = """<p>Hello world, it's <a href="https://google.com">foo bar time</a></p>
<p>Hello world, it's <a href="https://***.com">spam eggs</a></p>"""

soup = BeautifulSoup(spam, 'html.parser')

for a_tag in soup.find_all('a'):
    a_tag.replace_with(f"a_tag.text - a_tag.get('href')")

print(soup.text)

输出

Hello world, it's foo bar time - https://google.com
Hello world, it's spam eggs - https://***.com

注意，您可以从这里开始工作。看看tag.replace_with() 和tag.unwrap() Link to the docs

【讨论】：

【参考方案3】：

你可以使用 BeautifulSoup 模块。

from bs4 import BeautifulSoup

html = "<p>Hello world, it's <a href='https://google.com'>foo bar time</a></p>"
soup = BeautifulSoup(html, features="html.parser")

text = soup.get_text()
url_part = soup.find('a')
url_str = url_part['href']

print(text , ' - ' , url_str)

要导入模块，你需要安装它

pip install beautifulsoup4

【讨论】：

以上是关于Python - 将 HTML 超链接转换为格式化的纯文本的主要内容，如果未能解决你的问题，请参考以下文章