在 Python 中将 Unicode URL 转换为 ASCII（UTF-8 百分比转义）的最佳方法？

Posted 2023-02-22

技术标签:

【中文标题】在 Python 中将 Unicode URL 转换为 ASCII（UTF-8 百分比转义）的最佳方法？【英文标题】：Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python? 【发布时间】：2010-10-22 16:55:21 【问题描述】：

我想知道最好的方法是什么——或者如果标准库有一个简单的方法——将域名和路径中带有 Unicode 字符的 URL 转换为等效的 ASCII URL，用域编码为 IDNA 和根据 RFC 3986 编码的路径。

我从用户那里得到一个 UTF-8 格式的 URL。因此，如果他们输入了http://➡.ws/♥，我会在 Python 中得到'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'。我想要的是 ASCII 版本：'http://xn--hgi.ws/%E2%99%A5'。

我目前所做的是通过正则表达式将 URL 拆分为多个部分，然后手动对域进行 IDNA 编码，并使用不同的 urllib.quote() 调用分别对路径和查询字符串进行编码。

# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8')
match = re.match(r'([a-z]3,5)://(.+\.[a-z0-9]1,6)'
                 r'(:\d1,5)?(/.*?)(\?.*)?$', url, flags=re.I)
if not match:
    raise BadURLException(url)
protocol, domain, port, path, query = match.groups()

try:
    domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
    return ''  # bad UTF-8 chars in domain
domain = domain.encode('idna')

if port is None:
    port = ''

path = urllib.quote(path)

if query is None:
    query = ''
else:
    query = urllib.quote(query, safe='=&?/')

url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'

这是正确的吗？有更好的建议吗？是否有一个简单的标准库函数可以做到这一点？

【问题讨论】：

【参考方案1】：

代码：

import urlparse, urllib

def fixurl(url):
    # turn string into unicode
    if not isinstance(url,unicode):
        url = url.decode('utf8')

    # parse it
    parsed = urlparse.urlsplit(url)

    # divide the netloc further
    userpass,at,hostport = parsed.netloc.rpartition('@')
    user,colon1,pass_ = userpass.partition(':')
    host,colon2,port = hostport.partition(':')

    # encode each component
    scheme = parsed.scheme.encode('utf8')
    user = urllib.quote(user.encode('utf8'))
    colon1 = colon1.encode('utf8')
    pass_ = urllib.quote(pass_.encode('utf8'))
    at = at.encode('utf8')
    host = host.encode('idna')
    colon2 = colon2.encode('utf8')
    port = port.encode('utf8')
    path = '/'.join(  # could be encoded slashes!
        urllib.quote(urllib.unquote(pce).encode('utf8'),'')
        for pce in parsed.path.split('/')
    )
    query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')
    fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))

    # put it back together
    netloc = ''.join((user,colon1,pass_,at,host,colon2,port))
    return urlparse.urlunsplit((scheme,netloc,path,query,fragment))

print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F')
print fixurl(u'http://Åsa:abc123@➡.ws:81/admin')
print fixurl(u'http://➡.ws/admin')

输出：

http://xn--hgi.ws/%E2%99%A5http://xn--hgi.ws/%E2%99%A5/%2Fhttp://%C3%85sa:abc123@xn--hgi.ws:81/adminhttp://xn--hgi.ws/admin

编辑：

修复了字符串中已引用字符的大小写问题。已将urlparse/urlunparse 更改为urlsplit/urlunsplit。不要使用主机名对用户和端口信息进行编码。（感谢耶希亚）当缺少“@”时，不要将主机/端口视为用户/密码！（感谢 hupf）

【讨论】：

很好的解决方案，谢谢。关于使用 urlparse/unparse 的好电话，并注意输入中已引用字符的情况。但我不确定你为什么需要 split('/') 逻辑，因为 urllib.quote() 已经认为斜线是安全的。另请参阅下面我的新的、经过文档测试的、更完整的解决方案。问题是 '/' 被认为是路径分隔符，而 '%2F' 不是。如果我只是取消引用字符串，它们就会变成一个并且相同。也许永远不要取消引用路径会更好，并将所有现有的 '%' 编码为 '%25'..？ netloc != 域，所以你应该先从user:pass@domain:port 解析出域，然后转换为idna 具有用户/端口信息支持的编辑版本不再适用于没有用户或端口信息的 URL。用户名、密码和端口应该有条件地解析出来。我使用以下正则表达式来执行此操作：(?:(?P<user>[^:@]+)(?::(?P<password>[^:@]+))?@)?(?P<host>[^:]+)(?::(?P<port>[0-9]+))?，然后使用 groupdict 访问值：p.match(parsed.netloc).groupdict() 值得注意的是，urllib/2/requests 不支持嵌入在“netloc”中的凭据，因为它不是 http url 的标准。因此，它们必须被删除并在使用时作为单独的参数传递。【参考方案2】：

您可以改用urlparse.urlsplit，否则您似乎有一个非常简单的解决方案。

protocol, domain, path, query, fragment = urlparse.urlsplit(url)

（您可以通过访问返回值的命名属性来分别访问域和端口，但由于端口语法始终采用 ASCII 格式，因此不受 IDNA 编码过程的影响。）

【讨论】：

【参考方案3】：

有一些 RFC-3896 url 解析 工作正在进行中（例如，作为代码之夏的一部分），但在标准库中还没有 AFAIK —— uri 编码方面也没什么事情的另一面，再次 AFAIK。所以你不妨采用 MizardX 的优雅方法。

【讨论】：

【参考方案4】：

好的，有了这些 cmets 和我自己的代码中的一些错误修复（它根本不处理片段），我想出了以下 canonurl() 函数——返回一个规范的 ASCII 形式的网址：

import re
import urllib
import urlparse

def canonurl(url):
    r"""Return the canonical, ASCII-encoded form of a UTF-8 encoded URL, or ''
    if the URL looks invalid.

    >>> canonurl('    ')
    ''
    >>> canonurl('www.google.com')
    'http://www.google.com/'
    >>> canonurl('bad-utf8.com/path\xff/file')
    ''
    >>> canonurl('svn://blah.com/path/file')
    'svn://blah.com/path/file'
    >>> canonurl('1234://badscheme.com')
    ''
    >>> canonurl('bad$scheme://google.com')
    ''
    >>> canonurl('site.badtopleveldomain')
    ''
    >>> canonurl('site.com:badport')
    ''
    >>> canonurl('http://123.24.8.240/blah')
    'http://123.24.8.240/blah'
    >>> canonurl('http://123.24.8.240:1234/blah?q#f')
    'http://123.24.8.240:1234/blah?q#f'
    >>> canonurl('\xe2\x9e\xa1.ws')  # tinyarro.ws
    'http://xn--hgi.ws/'
    >>> canonurl('  http://www.google.com:80/path/file;params?query#fragment  ')
    'http://www.google.com:80/path/file;params?query#fragment'
    >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
    'http://xn--hgi.ws/%E2%99%A5'
    >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth')
    'http://xn--hgi.ws/%E2%99%A5/pa/th'
    >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth;par%2Fams?que%2Fry=a&b=c')
    'http://xn--hgi.ws/%E2%99%A5/pa/th;par/ams?que/ry=a&b=c'
    >>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5?\xe2\x99\xa5#\xe2\x99\xa5')
    'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5'
    >>> canonurl('http://\xe2\x9e\xa1.ws/%e2%99%a5?%E2%99%A5#%E2%99%A5')
    'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5'
    >>> canonurl('http://badutf8pcokay.com/%FF?%FE#%FF')
    'http://badutf8pcokay.com/%FF?%FE#%FF'
    >>> len(canonurl('google.com/' + 'a' * 16384))
    4096
    """
    # strip spaces at the ends and ensure it's prefixed with 'scheme://'
    url = url.strip()
    if not url:
        return ''
    if not urlparse.urlsplit(url).scheme:
        url = 'http://' + url

    # turn it into Unicode
    try:
        url = unicode(url, 'utf-8')
    except UnicodeDecodeError:
        return ''  # bad UTF-8 chars in URL

    # parse the URL into its components
    parsed = urlparse.urlsplit(url)
    scheme, netloc, path, query, fragment = parsed

    # ensure scheme is a letter followed by letters, digits, and '+-.' chars
    if not re.match(r'[a-z][-+.a-z0-9]*$', scheme, flags=re.I):
        return ''
    scheme = str(scheme)

    # ensure domain and port are valid, eg: sub.domain.<1-to-6-TLD-chars>[:port]
    match = re.match(r'(.+\.[a-z0-9]1,6)(:\d1,5)?$', netloc, flags=re.I)
    if not match:
        return ''
    domain, port = match.groups()
    netloc = domain + (port if port else '')
    netloc = netloc.encode('idna')

    # ensure path is valid and convert Unicode chars to %-encoded
    if not path:
        path = '/'  # eg: 'http://google.com' -> 'http://google.com/'
    path = urllib.quote(urllib.unquote(path.encode('utf-8')), safe='/;')

    # ensure query is valid
    query = urllib.quote(urllib.unquote(query.encode('utf-8')), safe='=&?/')

    # ensure fragment is valid
    fragment = urllib.quote(urllib.unquote(fragment.encode('utf-8')))

    # piece it all back together, truncating it to a maximum of 4KB
    url = urlparse.urlunsplit((scheme, netloc, path, query, fragment))
    return url[:4096]

if __name__ == '__main__':
    import doctest
    doctest.testmod()

【讨论】：

仅在 4096 个字符处将其截断可能会留下部分引用的字符。您可以使用正则表达式 r'%.?$' 来匹配任何尾随的部分转义。【参考方案5】：

MizardX 给出的代码不是 100% 正确的。这个例子不起作用：

example.com/folder/?page=2

查看 django.utils.encoding.iri_to_uri() 将 unicode URL 转换为 ASCII url。

http://docs.djangoproject.com/en/dev/ref/unicode/

【讨论】：

问题和答案已经快 10 年了，这仍然有效，并且仍然是我看到的最佳答案。如果您查看 Django 2.0 代码，它会导入 from urllib.parse import quote 并返回 quote(iri, safe="/#%[]=:;$&()+,!?*@'~")，因此这可能是任何现代旁观者可能想要进入的方向。

以上是关于在 Python 中将 Unicode URL 转换为 ASCII（UTF-8 百分比转义）的最佳方法？的主要内容，如果未能解决你的问题，请参考以下文章

在 Python 中将 Unicode URL 转换为 ASCII（UTF-8 百分比转义）的最佳方法？

代码：

输出：

阅读更多：

编辑：