将表情符号视为正则表达式中的一个字符[重复]

Posted 2023-02-23

技术标签:

【中文标题】将表情符号视为正则表达式中的一个字符[重复]【英文标题】：Treat an emoji as one character in a regex [duplicate] 【发布时间】：2018-06-24 18:52:27 【问题描述】：

这是一个小例子：

reg = ur"((?P<initial>[+\-????])(?P<rest>.+?))$"

（在这两种情况下，文件都有-*- coding: utf-8 -*-）

在 Python 2 中：

re.match(reg, u"????hello").groupdict()
# => u'initial': u'\ud83d', u'rest': u'\udc4dhello'
# unicode why must you do this

而在 Python 3 中：

re.match(reg, "????hello").groupdict()
# => 'initial': '????', 'rest': 'hello'

上述行为是 100% 完美的，但目前还不能切换到 Python 3。将 3 的结果复制到 2 中的最佳方法是什么，它适用于窄 Python 构建和宽 Python 构建？这？？？？似乎以“\ud83d\udc4d”的格式出现在我面前，这就是让这件事变得棘手的原因。

【问题讨论】：

看起来您的 Python 2 安装是一个狭窄的构建，因此它必须使用代码点 >= 0x10000 分解 Unicode 字符。 unichr(0x10000) 会引发错误，还是返回 u'\U00010000'？虽然它不能解决您的问题，但这里有一些关于窄与宽构建的信息：***.com/questions/29109944/… 另见***.com/questions/35404144/… 提醒一些表情符号包含多个 Unicode 代码点（涉及组合字符和零宽度连接符）。应该可以编写一个正则表达式来捕获它，但这不会是微不足道的（我什至不会尝试这个壮举）。 @MariusGedminas 是的，这正是我在这里遇到的问题 - ????字符由带有两个“\u”代码点的字符串组成。 【参考方案1】：

在 Python 2 窄版本中，非 BMP 字符是两个代理代码点，因此您无法在 [] 语法中正确使用它们。 u'[?] 等价于u'[\ud83d\udc4d]'，表示“匹配其中一个 \ud83d 或\udc4d。Python 2.7 示例：

>>> u'\U0001f44d' == u'\ud83d\udc4d' == u'?'
True
>>> re.findall(u'[?]',u'?')
[u'\ud83d', u'\udc4d']

要在 Python 2 和 3 中修复，请匹配 u'? 或 [+-]。这会在 Python 2 和 3 中返回正确的结果：

#coding:utf8
from __future__ import print_function
import re

# Note the 'ur' syntax is an error in Python 3, so properly
# escape backslashes in the regex if needed.  In this case,
# the backslash was unnecessary.
reg = u"((?P<initial>?|[+-])(?P<rest>.+?))$"

tests = u'?hello',u'-hello',u'+hello',u'\\hello'
for test in tests:
    m = re.match(reg,test)
    if m:
        print(test,m.groups())
    else:
        print(test,m)

输出（Python 2.7）：

?hello (u'\U0001f44dhello', u'\U0001f44d', u'hello')
-hello (u'-hello', u'-', u'hello')
+hello (u'+hello', u'+', u'hello')
\hello None

输出（Python 3.6）：

?hello ('?hello', '?', 'hello')
-hello ('-hello', '-', 'hello')
+hello ('+hello', '+', 'hello')
\hello None

【讨论】：

【参考方案2】：

这是因为 Python2 不区分字节和 unicode 字符串。

请注意，Python 2.7 解释器将字符表示为 4 个字节。要在 Python 3 中获得相同的行为，您必须将 unicode 字符串显式转换为字节对象。

# Python 2.7
>>> s = "?hello"
>>> s
'\xf0\x9f\x91\x8dhello'

# Python 3.5
>>> s = "?hello"
>>> s
'?hello'

因此，对于 Python 2，只需使用该字符的十六进制表示作为搜索模式（包括指定长度）即可。

>>> reg = "((?P<initial>[+\-\xf0\x9f\x91\x8d]4)(?P<rest>.+?))$"
>>> re.match(reg, s).groupdict()
'initial': '\xf0\x9f\x91\x8d', 'rest': 'hello'

【讨论】：

这适用于表情符号，但似乎打破了常规的 + 和 - 情况。 @naiveai [+\-\xf0\x9f\x91\x8d]4 => [+-]|\xf0\x9f\x91\x8d @WiktorStribiżew 这在 REPL 中确实有效，但在我的用例中，我收到了带有 \u 的字符串，我很确定这是我的错 - 我可以做任何方式他们匹配？简单地使用\ud83d\udc4d 似乎不起作用。您需要对要匹配的字符使用 \u 转义码。 \ud83d\udc4d 来自被错误地解释为两个 2 字节 unicode 字符的 4 字节序列。竖起大拇指表情符号的 \u 转义序列是 u"\U0001F44D"。 @UnoriginalNick 好的，肯定还有其他错误，因为即使这样也行不通。非常感谢您的回答，它会帮助我更接近一点。【参考方案3】：

只需单独使用u 前缀即可。

在 Python 2.7 中：

>>> reg = u"((?P<initial>[+\-?])(?P<rest>.+?))$"
>>> re.match(reg, u"?hello").groupdict()
'initial': '?', 'rest': 'hello'

【讨论】：

当我使用 python 2.7.13 尝试您的示例时，我得到了 'initial': '\xf0', 'rest': '\x9f\x91\x8dhello' 我不明白，我得到与以前完全相同的错误。没有。将 Unicode 放入普通的 Python 2 字符串不是一个好主意。仅对 Unicode 使用 u 字符串。等等，实际上，当我在我的真实用例下运行时，我得到了同样的错误。在 REPL 中，我得到与 @GarbageCollector 相同的结果这可能仅适用于 Python 2 的宽 unicode 构建。这些是 Linux 上的默认设置，但其他平台倾向于默认窄构建。【参考方案4】：

在 python 2.7 中有一个选项可以将该 unicode 转换为表情符号：

b = dict['vote'] # assign that unicode value to b 
print b.decode('unicode-escape')

我不知道这正是您要寻找的。但我认为您可以使用它以某种方式解决该问题。

【讨论】：

以上是关于将表情符号视为正则表达式中的一个字符[重复]的主要内容，如果未能解决你的问题，请参考以下文章