如何使用 PHP 跳过 XML 文件中的无效字符

Posted 2023-02-24

技术标签:

【中文标题】如何使用 PHP 跳过 XML 文件中的无效字符【英文标题】：How to skip invalid characters in XML file using PHP 【发布时间】：2011-03-28 19:39:46 【问题描述】：

我正在尝试使用 php 解析 XML 文件，但收到一条错误消息：

解析器错误：字符 0x0 超出允许范围

我认为是因为 XML 的内容，我认为有一个特殊符号“☆”，有什么想法可以修复它吗？

我也明白了：

解析器错误：标记项行中的数据过早结束

什么可能导致该错误？

我正在使用simplexml_load_file。

更新：

我尝试找到错误行并将其内容粘贴为单个 xml 文件，它可以工作！所以我仍然无法弄清楚是什么导致 xml 文件解析失败。 PS这是一个超过100M的巨大xml文件，会不会导致解析错误？

【问题讨论】：

【参考方案1】：

我用这个来清理字符串：

public static function Clean($inputName)
    
        $strName=trim($inputName);
        
        if($strName!="")
        
            $strName = iconv("UTF-8", "UTF-8//IGNORE", $strName); // drop all non utf-8 characters
            
            $strName=str_replace(array('\\','/',':','*','?','"','<','>','|'),'@',$strName); 
            $string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);
            
            // [\x00-\x1F]  control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx   
            
            // Invalid control chars: [\x00-\x08\x0B\x0C\x0E-\x1F]
            // UTF-16 surrogates: \xED[\xA0-\xBF].
            // Non-characters U+FFFE and U+FFFF: \xEF\xBF[\xBE\xBF]
            // Invalid characters are replaced with the replacement character U+FFFD 

            $strName = preg_replace(
            '/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
            "\xEF\xBF\xBD",
            $strName);
            
            // Reduce all multiple whitespace to a single space
            // $strName = preg_replace('/\s+/', ' ', $strName); 
            
            if(trim($strName)=="")
            
                $strName="@" . "empty-name";
            
        
        else
        
            $strName=" ";
               
        
        return $strName;

【讨论】：

【参考方案2】：

您对 XML 有控制权吗？如果是这样，请确保将数据包含在 <![CDATA[ .. ]]> 块中。

而且你还需要清除无效字符：

/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function stripInvalidXml($value)

    $ret = "";
    $current;
    if (empty($value)) 
    
        return $ret;
    
 
    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    
        $current = ord($value[$i]);
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||
            (($current >= 0x20) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        
            $ret .= chr($current);
        
        else
        
            $ret .= " ";
        
    
    return $ret;

【讨论】：

我无法控制 XML 但我可以问...但这是解决方案？？让我检查一下不确定这在这种情况下是否有帮助。您无法修复 CDATA 的编码问题，只能转义“&”而不是“&”之类的问题。是的，我同意。多米尼克有办法。 user315396：抱歉，您无法使用 CData 部分修复“超出允许范围”。这个功能坏了。 ord() 仅在单字节上运行。【参考方案3】：

某些 Unicode 字符 must not appear in XML 1.0:

C0 控制代码 (U+0000 - U+001F) 需要制表符、CR 和 LF。 UTF-16 代理 (U+D800 - U+DFFF)。这些在 UTF-8 中也是无效的，并且在遇到更严重的问题时表示。 U+FFFE 和 U+FFFF。

但在实践中，您经常不得不处理从包含此类字符的其他来源不小心生成的 XML。如果您想在 UTF-8 编码字符串中处理这种无效 XML 的特殊情况，我建议：

$str = preg_replace(
    '/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
    "\xEF\xBF\xBD",
    $str
);

这不使用u Unicode 正则表达式修饰符，而是直接用于UTF-8 编码字节以获得额外性能。该模式的部分是：

无效的控制字符：[\x00-\x08\x0B\x0C\x0E-\x1F] UTF-16 代理：\xED[\xA0-\xBF]. 非字符 U+FFFE 和 U+FFFF：\xEF\xBF[\xBE\xBF]

用替换字符 U+FFFD (�) 替换无效字符，而不是简单地去除它们。这使得诊断无效字符变得更容易，甚至可以prevent security issues。

【讨论】：

【参考方案4】：

如果您可以控制数据，请确保其编码正确（即使用您在 xml 标记中承诺的编码，例如，如果您有：

<?xml version="1.0" encoding="UTF-8"?>

那么您需要确保您的数据采用 UTF-8 格式。

如果您无法控制数据，请对那些控制数据的人大喊大叫。

您可以使用xmllint 之类的工具来检查数据的哪些部分无效。

【讨论】：

我尝试找到错误行并将其内容粘贴为单个 xml 文件，它可以工作！所以我仍然无法弄清楚是什么导致 xml 文件解析失败。在这种情况下，这恰恰强化了 Dominic 所说的话。好的...我认为某些数据不是 UTF-8 ..如果我在 FF 打开 XML，则会出现错误消息，表示错误字符。 IE...嗯...文件很大..我只是等了很长时间但没有回应。 @DominicRodger 谢谢！ xmllint 让我找到无效字符并删除它们。【参考方案5】：

不是 php 解决方案，但它可以工作：

下载记事本++https://notepad-plus-plus.org/

在 Notepad++ 中打开您的 .xml 文件

从主菜单：搜索 -> 搜索模式将其设置为：扩展

那么，

替换 -> 查找\x00;替换为 leave empty

然后，全部替换

罗伯

【讨论】：

【参考方案6】：

我的问题是 "&" 字符（HEX 0x24），我改为：

function stripInvalidXml($value)

    $ret = "";
    $current;
    if (empty($value)) 
    
        return $ret;
    

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    
        $current = ord($value$i);
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||

            (($current >= 0x28) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        
            $ret .= chr($current);
        
        else
        
            $ret .= " ";
        
    
    return $ret;

【讨论】：

【参考方案7】：

我决定测试所有 UTF-8 值 (0-1114111) 以确保一切正常。在测试所有 utf-8 值时，使用 preg_replace() 会由于错误而返回 NULL。这是我想出的解决方案。

$utf_8_range = range(0, 1114111);
$output = ords_to_utfstring($utf_8_range);
$sanitized = sanitize_for_xml($output);


/**
 * Removes invalid XML
 *
 * @access public
 * @param string $value
 * @return string
 */
function sanitize_for_xml($input) 
  // Convert input to UTF-8.
  $old_setting = ini_set('mbstring.substitute_character', '"none"');
  $input = mb_convert_encoding($input, 'UTF-8', 'auto');
  ini_set('mbstring.substitute_character', $old_setting);

  // Use fast preg_replace. If failure, use slower chr => int => chr conversion.
  $output = preg_replace('/[^\x0009\x000a\x000d\x0020-\xD7FF\xE000-\xFFFD]+/u', '', $input);
  if (is_null($output)) 
    // Convert to ints.
    // Convert ints back into a string.
    $output = ords_to_utfstring(utfstring_to_ords($input), TRUE);
  
  return $output;


/**
 * Given a UTF-8 string, output an array of ordinal values.
 *
 * @param string $input
 *   UTF-8 string.
 * @param string $encoding
 *   Defaults to UTF-8.
 *
 * @return array
 *   Array of ordinal values representing the input string.
 */
function utfstring_to_ords($input, $encoding = 'UTF-8')
  // Turn a string of unicode characters into UCS-4BE, which is a Unicode
  // encoding that stores each character as a 4 byte integer. This accounts for
  // the "UCS-4"; the "BE" prefix indicates that the integers are stored in
  // big-endian order. The reason for this encoding is that each character is a
  // fixed size, making iterating over the string simpler.
  $input = mb_convert_encoding($input, "UCS-4BE", $encoding);

  // Visit each unicode character.
  $ords = array();
  for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) 
    // Now we have 4 bytes. Find their total numeric value.
    $s2 = mb_substr($input, $i, 1, "UCS-4BE");
    $val = unpack("N", $s2);
    $ords[] = $val[1];
  
  return $ords;


/**
 * Given an array of ints representing Unicode chars, outputs a UTF-8 string.
 *
 * @param array $ords
 *   Array of integers representing Unicode characters.
 * @param bool $scrub_XML
 *   Set to TRUE to remove non valid XML characters.
 *
 * @return string
 *   UTF-8 String.
 */
function ords_to_utfstring($ords, $scrub_XML = FALSE) 
  $output = '';
  foreach ($ords as $ord) 
    // 0: Negative numbers.
    // 55296 - 57343: Surrogate Range.
    // 65279: BOM (byte order mark).
    // 1114111: Out of range.
    if (   $ord < 0
        || ($ord >= 0xD800 && $ord <= 0xDFFF)
        || $ord == 0xFEFF
        || $ord > 0x10ffff) 
      // Skip non valid UTF-8 values.
      continue;
    
    // 9: Anything Below 9.
    // 11: Vertical Tab.
    // 12: Form Feed.
    // 14-31: Unprintable control codes.
    // 65534, 65535: Unicode noncharacters.
    elseif ($scrub_XML && (
               $ord < 0x9
            || $ord == 0xB
            || $ord == 0xC
            || ($ord > 0xD && $ord < 0x20)
            || $ord == 0xFFFE
            || $ord == 0xFFFF
            )) 
      // Skip non valid XML values.
      continue;
    
    // 127: 1 Byte char.
    elseif ( $ord <= 0x007f) 
      $output .= chr($ord);
      continue;
    
    // 2047: 2 Byte char.
    elseif ($ord <= 0x07ff) 
      $output .= chr(0xc0 | ($ord >> 6));
      $output .= chr(0x80 | ($ord & 0x003f));
      continue;
    
    // 65535: 3 Byte char.
    elseif ($ord <= 0xffff) 
      $output .= chr(0xe0 | ($ord >> 12));
      $output .= chr(0x80 | (($ord >> 6) & 0x003f));
      $output .= chr(0x80 | ($ord & 0x003f));
      continue;
    
    // 1114111: 4 Byte char.
    elseif ($ord <= 0x10ffff) 
      $output .= chr(0xf0 | ($ord >> 18));
      $output .= chr(0x80 | (($ord >> 12) & 0x3f));
      $output .= chr(0x80 | (($ord >> 6) & 0x3f));
      $output .= chr(0x80 | ($ord & 0x3f));
      continue;
    
  
  return $output;

并在一个简单的对象或数组上执行此操作

// Recursive sanitize_for_xml.
function recursive_sanitize_for_xml(&$input)
  if (is_null($input) || is_bool($input) || is_numeric($input)) 
    return;
  
  if (!is_array($input) && !is_object($input)) 
    $input = sanitize_for_xml($input);
  
  else 
    foreach ($input as &$value) 
      recursive_sanitize_for_xml($value);

【讨论】：

这很有帮助！虽然我还没有找到触发“长”版本的输入，但你有什么需要的例子吗？ @mcfedr preg_replace 可以在更高版本的 php 中修复。我相信这是 php 5.4【参考方案8】：

有关将此类输入加载到 SimpleXMLElement 的非破坏性方法，请参阅我在 How to handle invalid unicode with simplexml 上的回答

【讨论】：

【参考方案9】：

确保您的 XML 源代码有效。见http://en.wikipedia.org/wiki/List_of_XML_and_html_character_entity_references

【讨论】：

XML 是有效的，如果编码是 UTF-8 但有一个 Big5 char ，我发现 char ""。

以上是关于如何使用 PHP 跳过 XML 文件中的无效字符的主要内容，如果未能解决你的问题，请参考以下文章