python-docx-lxml.etree.XMLSyntaxError:AttValue长度太长

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python-docx-lxml.etree.XMLSyntaxError:AttValue长度太长相关的知识,希望对你有一定的参考价值。

[我正在编写程序来检查一堆.docx文件中是否存在单词(我们正在谈论大约2500个.docx文件。

这是代码中多汁的部分:

for filename in directorylist:
    if filename.endswith(".docx"):
        i = Document(filename)

        print(filename)


        for destination in destinationlist:
            for paragraph in i.paragraphs:
                if destination in paragraph.text:
                    destinationcount[destination] = 1
                    break
                else:
                    destinationcount[destination] = 0
                    continue

        for destination in destinationcount:
            destinationcountnobool[destination] += destinationcount[destination]

    else:
        continue

现在,我知道您在想什么,一般来说,这是一堆令人讨厌的循环和糟糕的编程,但这是快速而又肮脏的工作,请多多指教。

这是我得到的错误:

Traceback (most recent call last):
  File "ICrunchMeSomeFiles.py", line 27, in <module>
    i = Document(filename)
  File "C:UsersUserAnaconda3libsite-packagesdocxapi.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "C:UsersUserAnaconda3libsite-packagesdocxopcpackage.py", line 130, in open
    Unmarshaller.unmarshal(pkg_reader, package, PartFactory)
  File "C:UsersUserAnaconda3libsite-packagesdocxopcpackage.py", line 199, in unmarshal
    pkg_reader, package, part_factory
  File "C:UsersUserAnaconda3libsite-packagesdocxopcpackage.py", line 216, in _unmarshal_parts
    partname, content_type, reltype, blob, package
  File "C:UsersUserAnaconda3libsite-packagesdocxopcpart.py", line 191, in __new__
    return PartClass.load(partname, content_type, blob, package)
  File "C:UsersUserAnaconda3libsite-packagesdocxopcpart.py", line 231, in load
    element = parse_xml(blob)
  File "C:UsersUserAnaconda3libsite-packagesdocxoxml\__init__.py", line 28, in parse_xml
    root_element = etree.fromstring(xml, oxml_parser)
  File "srclxmletree.pyx", line 3236, in lxml.etree.fromstring
  File "srclxmlparser.pxi", line 1876, in lxml.etree._parseMemoryDocument
  File "srclxmlparser.pxi", line 1764, in lxml.etree._parseDoc
  File "srclxmlparser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
  File "srclxmlparser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "srclxmlparser.pxi", line 711, in lxml.etree._handleParseResult
  File "srclxmlparser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 2
lxml.etree.XMLSyntaxError: AttValue length too long, line 2, column 11011745

该程序适用于较小的样本,所以我认为这是内存问题。帮助将不胜感激

答案

我认为可以解决您的问题:

以上是关于python-docx-lxml.etree.XMLSyntaxError:AttValue长度太长的主要内容,如果未能解决你的问题,请参考以下文章