如何在 Python 3.1 中对字符串中的 HTML 实体进行转义？ [复制]

Posted 2023-02-23

技术标签:

【中文标题】如何在 Python 3.1 中对字符串中的 HTML 实体进行转义？ [复制]【英文标题】：How do I unescape HTML entities in a string in Python 3.1? [duplicate] 【发布时间】：2011-01-22 13:37:38 【问题描述】：

我环顾四周，只找到了适用于 python 2.6 及更早版本的解决方案，没有关于如何在 python 3.X 中执行此操作。（我只有Win7盒子。）

我必须能够在 3.1 中做到这一点，最好没有外部库。目前，我已经安装了 httplib2 并可以访问命令提示符 curl（这就是我获取页面源代码的方式）。不幸的是，curl 不解码 html 实体，据我所知，我在文档中找不到解码它的命令。

是的，我尝试过让 Beautiful Soup 工作，但在 3.X 中很多次都没有成功。如果您能提供关于如何使其在 MS Windows 环境中的 python 3 中工作的明确说明，我将非常感激。

所以，为了清楚起见，我需要将这样的字符串：Suzy &amp; John 变成这样的字符串：“Suzy & John”。

【问题讨论】：

【参考方案1】：

你可以使用函数html.unescape:

在 Python3.4+ 中（感谢 J.F. Sebastian 的更新）：

import html
html.unescape('Suzy &amp; John')
# 'Suzy & John'

html.unescape('&quot;')
# '"'

Python3.3 或更早版本：

import html.parser    
html.parser.HTMLParser().unescape('Suzy &amp; John')

在Python2中：

import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy &amp; John')

【讨论】：

这不会转义"例如。 @moose：感谢您的警告。我已将答案更改为可以处理更多 HTML 实体的内容，包括 &quot;。非常感谢！我给了你的答案 +1。自 Python 3.4 起公开为html.escape() @SaurabhYadav：html package 是 Python 标准库的一部分。它不需要单独安装。如果import html 引发错误，则说明您的 Python 发行版未正确安装。【参考方案2】：

您可以为此目的使用xml.sax.saxutils.unescape。该模块包含在 Python 标准库中，可在 Python 2.x 和 Python 3.x 之间移植。

>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape("Suzy &amp; John")
'Suzy & John'

【讨论】：

似乎不完整，'&euml' 没有用这个解码，虽然它用 htmlparser 解码它也不会转义十进制字符【参考方案3】：

显然我没有足够高的声誉来做任何事情，除了发布这个。 unutbu 的回答并没有避免引用。我发现唯一能做的就是这个函数：

import re
from htmlentitydefs import name2codepoint as n2cp

def decodeHtmlentities(string):
    def substitute_entity(match):        
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)
            if cp:
                return unichr(cp)
            else:
                return match.group()
    entity_re = re.compile("&(#?)(\d1,5|\w1,8);")
    return entity_re.subn(substitute_entity, string)[0]

我从这个page得到的。

【讨论】：

【参考方案4】：

Python 3.x 也有 html.entities

【讨论】：

【参考方案5】：

在我的例子中，我在 as3 转义函数中转义了一个 html 字符串。经过一个小时的谷歌搜索没有发现任何有用的东西，所以我写了这个 recursive 函数来满足我的需要。在这里，

def unescape(string):
    index = string.find("%")
    if index == -1:
        return string
    else:
        #if it is escaped unicode character do different decoding
        if string[index+1:index+2] == 'u':
            replace_with = ("\\"+string[index+1:index+6]).decode('unicode_escape')
            string = string.replace(string[index:index+6],replace_with)
        else:
            replace_with = string[index+1:index+3].decode('hex')
            string = string.replace(string[index:index+3],replace_with)
        return unescape(string)

Edit-1添加了处理 unicode 字符的功能。

【讨论】：

【参考方案6】：

我不确定这是否是一个内置库，但它看起来像你需要的并且支持 3.1。

发件人：http://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape

xml.sax.saxutils.unescape（数据，实体=）在数据字符串中取消转义“&”、“”。

【讨论】：

以上是关于如何在 Python 3.1 中对字符串中的 HTML 实体进行转义？ [复制]的主要内容，如果未能解决你的问题，请参考以下文章