解决文件中的unicode输入字符串与unicode数据的比较

Posted 2023-03-29

技术标签:

【中文标题】解决文件中的unicode输入字符串与unicode数据的比较【英文标题】：solving the comparison of unicode input string in the file with unicode data 【发布时间】：2015-11-08 14:14:11 【问题描述】：

string1=" म नेपाली  हुँ"
string1=string1.split()
string1[0]
'\xe0\xa4\xae'

with codecs.open('nepaliwords.txt','r','utf-8') as f:
     for line in f:
             if string1[0] in line:
                     print "matched string found in file"

Traceback（最近一次调用最后一次）：文件“”，第 3 行，in UnicodeDecodeError: 'ascii' 编解码器无法在位置解码字节 0xe0 0: 序数不在范围内(128)

在文本文件中，我有大量的尼泊尔语 unicode。

我在这里比较两个 unicode 字符串做错了吗？

如何打印匹配的 unicode 字符串？

【问题讨论】：

【参考方案1】：

您的string1 是一个字节字符串，编码为UTF-8。它不是 Unicode 字符串。但是您使用codecs.open() 让Python 将文件内容解码到unicode。然后尝试将字节字符串与包含测试一起使用会导致 Python 将字节字符串隐式解码为 unicode 以匹配类型。这会失败，因为隐式解码使用 ASCII。

首先将string1解码为unicode：

string1 = " म नेपाली  हुँ"
string1 = string1.decode('utf8').split()[0]

或使用 Unicode 字符串文字代替：

string1 = u" म नेपाली  हुँ"
string1 = string1.split()[0]

注意开头的u。

【讨论】：

感谢 string1=u" म नेपाली हुँ" 解决了我的问题。对于 string1 = string1.split()[0] [0] 创建的问题.. 谢谢你能帮我打印匹配的字符串吗？

以上是关于解决文件中的unicode输入字符串与unicode数据的比较的主要内容，如果未能解决你的问题，请参考以下文章