带有 PhpWord 的隐蔽 HTML：错误 - DOMDocument::loadXML(): 实体中未定义 p 上的命名空间前缀 o

Posted 2023-03-22

技术标签:

【中文标题】带有 PhpWord 的隐蔽 HTML：错误 - DOMDocument::loadXML(): 实体中未定义 p 上的命名空间前缀 o【英文标题】：Covert HTML with PhpWord: error - DOMDocument::loadXML(): Namespace prefix o on p is not defined in Entity 【发布时间】：2019-02-28 04:46:01 【问题描述】：

我正在尝试隐藏使用 php 字格式的 html。

我用summernote 创建了一个html 表单。 Summernote 允许用户格式化文本。此文本使用 html 标记保存到数据库中。

接下来使用phpWord，我想将捕获的信息输出到word文档中。请看下面的代码：

$rational = DB::table('rationals')->where('qualificationheader_id',$qualId)->value('rational');

 $wordTest = new \PhpOffice\PhpWord\PhpWord();
        $newSection = $wordTest->addSection();
        $newSection->getStyle()->setPageNumberingStart(1);


    \PhpOffice\PhpWord\Shared\Html::addHtml($newSection,$rational);
    $footer = $newSection->addFooter();
    $footer->addText($curriculum->curriculum_code.'-'.$curriculum->curriculum_title);



    $objectWriter = \PhpOffice\PhpWord\IOFactory::createWriter($wordTest,'Word2007');
    try 
        $objectWriter->save(storage_path($curriculum->curriculum_code.'-'.$curriculum->curriculum_title.'.docx'));
     catch (Exception $e) 
    

    return response()->download(storage_path($curriculum->curriculum_code.'-'.$curriculum->curriculum_title.'.docx'));

保存在数据库中的文本如下所示：

<p class="MsoNormal"><span lang="EN-GB" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial;"><span style="font-family: Arial;">The want for this qualification originated from the energy crisis in
South Africa in 2008 together with the fact that no existing qualifications
currently focuses on energy efficiency as one of the primary solutions.  </span><span style="font-family: Arial;">The fact that energy supply remains under
severe pressure demands the development of skills sets that can deliver the
necessary solutions.</span><span style="font-family: Arial;">  </span><o:p></o:p></span></p><p class="MsoNormal"><span lang="EN-GB" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; font-family: Arial;">This qualification addresses the need from Industry to acquire credible
and certified professionals with specialised skill sets in the energy
efficiency field. The need for this skill set has been confirmed as a global
requirement in few of the International commitment to the reduction of carbon

我收到以下错误：

错误异常 (E_WARNING) DOMDocument::loadXML(): 实体中未定义 p 上的命名空间前缀 o，行：1

【问题讨论】：

【参考方案1】：

问题

解析器抱怨你的文本在元素标签中包含命名空间，更具体地说是标签<o:p> 上的前缀（其中o: 是前缀）。好像是some kind of formatting for Word。

重现问题

为了重现这个问题，我不得不挖掘一下，因为不是 PHPWord 引发了异常，而是 PHPWord 正在使用的DOMDocument。下面的代码使用的是 PHPWord 正在使用的same parsing method，并且应该输出有关代码的所有警告和通知。

# Make sure to display all errors
ini_set("display_errors", "1");
error_reporting(E_ALL);

$html = '<o:p>Foo <o:b>Bar</o:b></o:p>';

# Set up and parse the code
$doc = new DOMDocument();
$doc->loadXML($html); # This is the line that's causing the warning.
# Print it back
echo $doc->saveXML();

分析

对于格式良好的 HTML 结构，可以在声明中包含命名空间，从而告诉解析器这些前缀实际上是什么。但由于它似乎只是要被解析的 HTML 代码的一部分，所以这是不可能的。

可以提供DOMXPath with the namespace，以便PHPWord 可以使用它。不幸的是，API 中的DOMXPath isn't public 因此不可能。

相反，似乎最好的方法是从标签中去除前缀，并让警告消失。

编辑 2018-10-04：我后来发现了一种将前缀保留在标签中并且仍然使错误消失的方法，但是执行不是最佳的。如果有人可以提出更好的解决方案，请随时编辑我的帖子或发表评论。

解决方案

根据分析，解决方案是去掉前缀，反过来我们必须对代码进行预解析。 Since PHPWord is using DOMDocument，我们也可以使用它，并确保我们不需要安装任何（额外的）依赖项。

PHPWord 正在用loadXML 解析 HTML，这是一个抱怨格式化的函数。在这种方法中可以抑制错误消息，我们将在两种解决方案中都必须这样做。这是由passing an additional parameter 到loadXML 和loadHTML 函数中完成的。

方案一：预解析为 XML 并去掉前缀

第一种方法将 html 代码解析为 XML 并递归遍历树并删除标记名称上出现的任何前缀。

我创建了一个可以解决这个问题的类。

class TagPrefixFixer 

    /**
      * @desc Removes all prefixes from tags
      * @param string $xml The XML code to replace against.
      * @return string The XML code with no prefixes in the tags.
    */
    public static function Clean(string $xml) 
        $doc = new DOMDocument();
        /* Load the XML */
        $doc->loadXML($xml,
            LIBXML_HTML_NOIMPLIED | # Make sure no extra BODY
            LIBXML_HTML_NODEFDTD |  # or DOCTYPE is created
            LIBXML_NOERROR |        # Suppress any errors
            LIBXML_NOWARNING        # or warnings about prefixes.
        );
        /* Run the code */
        self::removeTagPrefixes($doc);
        /* Return only the XML */
        return $doc->saveXML();
    

    private static function removeTagPrefixes(DOMNode $domNode) 
        /* Iterate over each child */
        foreach ($domNode->childNodes as $node) 
            /* Make sure the element is renameable and has children */
            if ($node->nodeType === 1) 
                /* Iterate recursively over the children.
                 * This is done before the renaming on purpose.
                 * If we rename this element, then the children, the element
                 * would need to be moved a lot more times due to how 
                 * renameNode works. */
                if($node->hasChildNodes()) 
                    self::removeTagPrefixes($node);
                
                /* Check if the tag contains a ':' */
                if (strpos($node->tagName, ':') !== false) 
                    print $node->tagName;
                    /* Get the last part of the tag name */
                    $parts = explode(':', $node->tagName);
                    $newTagName = end($parts);
                    /* Change the name of the tag */
                    self::renameNode($node, $newTagName);
                
            
        
    

    private static function renameNode($node, $newName) 
        /* Create a new node with the new name */
        $newNode = $node->ownerDocument->createElement($newName);
        /* Copy over every attribute from the old node to the new one */
        foreach ($node->attributes as $attribute) 
            $newNode->setAttribute($attribute->nodeName, $attribute->nodeValue);
        
        /* Copy over every child node to the new node */
        while ($node->firstChild) 
            $newNode->appendChild($node->firstChild);
        
        /* Replace the old node with the new one */
        $node->parentNode->replaceChild($newNode, $node);

要使用代码，只需调用TagPrefixFixer::Clean 函数。

$xml = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print TagPrefixFixer::Clean($xml);

输出

<?xml version="1.0"?>
<p>Foo <b>Bar</b></p>

解决方案 2：预解析为 HTML

我注意到，如果您使用loadHTML 而不是loadXML，那么PHPWord is using 会在将 HTML 加载到类中时自行删除前缀。

这段代码明显更短。

function cleanHTML($html) 
    $doc = new DOMDocument();
    /* Load the HTML */
    $doc->loadHTML($html,
            LIBXML_HTML_NOIMPLIED | # Make sure no extra BODY
            LIBXML_HTML_NODEFDTD |  # or DOCTYPE is created
            LIBXML_NOERROR |        # Suppress any errors
            LIBXML_NOWARNING        # or warnings about prefixes.
    );
    /* Immediately save the HTML and return it. */
    return $doc->saveHTML();

要使用此代码，只需调用cleanHTML 函数

$html = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print cleanHTML($html);

输出

<p>Foo <b>Bar</b></p>

解决方案 3：保留前缀并添加命名空间

在将数据输入解析器之前，我尝试使用给定的Microsoft Office namespaces 包装代码，这也将解决问题。具有讽刺意味的是，我还没有找到一种方法来使用 DOMDocument 解析器添加命名空间而不实际引发原始警告。所以 - 这个解决方案的执行有点笨拙，我不建议使用它，而是自己构建。但你明白了：

function addNamespaces($xml) 
    $root = '<w:wordDocument
        xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
        xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
        xmlns:o="urn:schemas-microsoft-com:office:office">';
    $root .= $xml;
    $root .= '</w:wordDocument>';
    return $root;

要使用此代码，只需调用 addNamespaces 函数

$xml = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print addNamespaces($xml);

输出

<w:wordDocument
    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
    xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
    xmlns:o="urn:schemas-microsoft-com:office:office">
    <o:p>Foo <o:b>Bar</o:b></o:p>
</w:wordDocument>

然后可以将此代码提供给 PHPWord 函数 addHtml 而不会引起任何警告。

可选解决方案（已弃用）

在之前的回复中，这些是作为（可选）解决方案提供的，但为了解决问题，我将让它们出现在下面。请记住，这些都不推荐，应谨慎使用。

关闭警告

由于它“只是”一个警告而不是致命的停止异常，因此您可以关闭警告。您可以通过在脚本顶部包含此代码来执行此操作。然而，这仍然会减慢您的应用程序，最好的方法始终是确保没有警告或错误。

// Show the default reporting except from warnings
error_reporting(E_ALL & ~E_NOTICE & ~E_STRICT & ~E_DEPRECATED & ~E_WARNING);

这些设置来自default reporting level。

使用正则表达式

在将regex on your text 保存到数据库之前，或者在获取它以供此函数使用之后，（可能）可以删除（大部分）带有regex on your text 的命名空间。由于它已经存储在数据库中，最好在从数据库中获取它之后使用下面的代码。正则表达式可能会遗漏一些事件，或者在最坏的情况下会弄乱 HTML。

正则表达式：

$text_after = preg_replace('/[a-zA-Z]+:([a-zA-Z]+[=>])/', '$1', $text_before);

示例：

$text = '<o:p>Foo <o:b>Bar</o:b></o:p>';
$text = preg_replace('/[a-zA-Z]+:([a-zA-Z]+[=>])/', '$1', $text);
echo $text; // Outputs '<p>Foo <b>Bar</b></p>'

【讨论】：

HTML 上的正则表达式？不！ ***.com/questions/1732348/… 你是对的@delboy1978uk。我用另一种更可持续的方法重新设计了整个解决方案。对于已经阅读到底部的任何人：我将测试在解析数据之前添加命名空间标签以查看或解决问题而无需抑制任何警告，但是直到今天晚些时候我才有时间去做。跟进我之前的评论：可以保留前缀并仍然解析代码而不会出现任何警告/错误。我已将结果添加为我的 3:rd 解决方案，即使它的执行不是最佳的。

以上是关于带有 PhpWord 的隐蔽 HTML：错误 - DOMDocument::loadXML(): 实体中未定义 p 上的命名空间前缀 o的主要内容，如果未能解决你的问题，请参考以下文章