使用 xlrd 打开 Excel 文件时出现编码错误

Posted 2023-05-07

技术标签:

【中文标题】使用 xlrd 打开 Excel 文件时出现编码错误【英文标题】：Encoding error when opening an Excel file with xlrd 【发布时间】：2015-02-05 02:01:49 【问题描述】：

我正在尝试使用 xlrd 打开一个 Excel 文件 (.xls)。这是我正在使用的代码的摘要：

import xlrd
workbook = xlrd.open_workbook('thefile.xls')

这适用于大多数文件，但不适用于我从特定组织获得的文件。以下是我尝试打开该组织的 Excel 文件时遇到的错误。

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 116, in open_workbook_xls
    bk.parse_globals()
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1180, in parse_globals
    self.handle_writeaccess(data)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1145, in handle_writeaccess
    strg = unpack_unicode(data, 0, lenlen=2)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/biffh.py", line 303, in unpack_unicode
    strg = unicode(rawstrg, 'utf_16_le')
  File "/app/.heroku/python/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x40 in position 104: truncated data

这看起来好像 xlrd 正在尝试打开以非 UTF-16 编码的 Excel 文件。我怎样才能避免这个错误？文件是以有缺陷的方式编写的，还是只是一个特定的字符导致了问题？如果我打开并重新保存 Excel 文件，xlrd 会毫无问题地打开文件。

我尝试使用不同的编码覆盖打开工作簿，但这也不起作用。

我尝试打开的文件在此处可用：

https://dl.dropboxusercontent.com/u/6779408/***/AEPUsageHistoryDetail_RequestID_00183816.xls

此处报告的问题：https://github.com/python-excel/xlrd/issues/128

【问题讨论】：

【参考方案1】：

他们用什么来生成那个文件？

他们正在使用一些 Java Excel API（见下文，link here），可能在 IBM 大型机或类似主机上。

从堆栈跟踪中，写入访问信息无法解码为 Unicode，因为 @ 字符。

有关 XLS 文件格式的写入访问信息的更多信息，请参阅 5.112 WRITEACCESS 或 Page 277。

此字段包含保存文件的用户的用户名。

import xlrd
dump = xlrd.dump('thefile.xls')

在原始文件上运行 xlrd.dump 会给出

   36: 005c WRITEACCESS len = 0070 (112)
   40:      d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40  ????@?????@???@@
   56:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
   72:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
   88:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  104:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  120:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  136:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@

使用 Excel 或在我的情况下为 LibreOffice Calc 重新保存后，写入访问信息被类似的内容覆盖

 36: 005c WRITEACCESS len = 0070 (112)
 40:      04 00 00 43 61 6c 63 20 20 20 20 20 20 20 20 20  ?~~Calc         
 56:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
 72:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
 88:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
104:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
120:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
136:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

根据编码为 40 的空格，我认为编码是 EBCDIC，当我们将 d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40 转换为 EBCDIC 时，我们得到 Java Excel API。

所以是的，在 BIFF8 及更高版本的情况下，文件是以有缺陷的方式写入的，它应该是一个 unicode 字符串，在 BIFF3 到 BIFF5 中，它应该是 CODEPAGE 信息中编码中的字节字符串，即

 152: 0042 CODEPAGE len = 0002 (2)
 156:      12 52                                            ?R

1252 是 Windows CP-1252 (Latin I) (BIFF4-BIFF5)，不是 EBCDIC_037。

xlrd 尝试使用 unicode 的事实意味着它确定文件的版本是 BIFF8。

在这种情况下，您有两个选择

在使用 xlrd 打开之前修复文件。您可以使用转储检查非标准输出文件，如果是这种情况，您可以使用 xlutils.save 或其他库覆盖写入访问信息。

修补 xlrd 以处理您的特殊情况，在 handle_writeaccess 添加一个 try 块并将 strg 设置为 unpack_unicode 失败时的空字符串。

下面的sn-p

 def handle_writeaccess(self, data):
        DEBUG = 0
        if self.biff_version < 80:
            if not self.encoding:
                self.raw_user_name = True
                self.user_name = data
                return
            strg = unpack_string(data, 0, self.encoding, lenlen=1)
        else:
            try:
                strg = unpack_unicode(data, 0, lenlen=2)
            except:
                strg = ""
        if DEBUG: fprintf(self.logfile, "WRITEACCESS: %d bytes; raw=%s %r\n", len(data), self.raw_user_name, strg)
        strg = strg.rstrip()
        self.user_name = strg

与

workbook=xlrd.open_workbook('thefile.xls',encoding_override="cp1252")

似乎打开文件成功了。

如果没有编码覆盖，它会抱怨ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010

【讨论】：

我不确定组织使用什么来编写 Excel 文件，尽管我对此有疑问。我会尝试你的第二个选项，因为你说它对你有用，然后在这里发回我的结果。我充满希望——感谢您的出色回应。我不能再奖励 15 个小时的赏金——但我会这样做。再次感谢。【参考方案2】：

这对我有用。

import xlrd

my_xls = xlrd.open_workbook('//myshareddrive/something/test.xls',encoding_override="gb2312")

【讨论】：

以上是关于使用 xlrd 打开 Excel 文件时出现编码错误的主要内容，如果未能解决你的问题，请参考以下文章