检测 XML 的更好方法？

Posted 2023-02-24

技术标签:

【中文标题】检测 XML 的更好方法？【英文标题】：Better way to detect XML? 【发布时间】：2010-10-27 20:00:45 【问题描述】：

目前，我有以下 c# 代码从文本中提取值。如果是 XML，我想要其中的值 - 否则，如果不是 XML，它可以只返回文本本身。

String data = "..."
try

    return XElement.Parse(data).Value;

catch (System.Xml.XmlException)

    return data;

我知道异常在 C# 中很昂贵，所以我想知道是否有更好的方法来确定我正在处理的文本是否为 xml？

我想到了正则表达式测试，但我不认为这是一个更便宜的选择。请注意，我要求的是一种更便宜的方法。

【问题讨论】：

例外是免费的，我总是把它们扔掉。除非您证明有问题，否则上面的代码没有任何问题，这实际上只是代码异味。有没有人测试过下面的方法实际上更快，需要这个速度吗？ @JustEngland 实际上，在大多数 C++ 实现中，异常都很慢。但 C# 可能是另一种情况。我没有用过 C#，所以我无法评论 C# 中的异常性能。我可以在 C++ 中每秒循环 4 亿次迭代，但每次迭代都会抛出异常，它小于每秒百万次迭代。哇，多么棒的一个线程，我什至不再编写 c# 代码了 :) 我今天给出的最好建议是，但我会比较框架中所有不同的解析器。您也可以通过一些基本检查来作弊。还有一个 3rd 方 xml 解析器可用，具有更好的性能。即使性能不是问题，最好避免在非异常情况下抛出异常。我们的工具是为了发现异常而构建的，它们可能会在调试其他东西时妨碍它们。我认为这是一个“vexing exception”的例子，虽然没有 Eric 给出的 int.Parse 例子那么严重。 【参考方案1】：

您可以对

（手写。）

// Has to have length to be XML
if (!string.IsNullOrEmpty(data))

    // If it starts with a < after trimming then it probably is XML
    // Need to do an empty check again in case the string is all white space.
    var trimmedData = data.TrimStart();
    if (string.IsNullOrEmpty(trimmedData))
    
       return data;
    

    if (trimmedData[0] == '<')
    
        try
        
            return XElement.Parse(data).Value;
        
        catch (System.Xml.XmlException)
        
            return data;
        
    

else

    return data;

我最初使用的是正则表达式，但 Trim()[0] 与该正则表达式的作用相同。

【讨论】：

+1 这个概念，因为它会清除 99% 的异常，但我觉得这里不需要正则表达式。 StartsWith 或 IndexOf 会更好更快。嗯，StartsWith 不起作用，因为允许空格，并且 IndexOf 需要在索引为空格之前知道所有内容。尽管可以使用 IndexOf ，但我会为此修改答案。【参考方案2】：

下面给出的代码将匹配以下所有 xml 格式：

<text />                             
<text/>                              
<text   />                           
<text>xml data1</text>               
<text attr='2'>data2</text>");
<text attr='2' attr='4' >data3 </text>
<text attr>data4</text>              
<text attr1 attr2>data5</text>

这是代码：

public class XmlExpresssion

    // EXPLANATION OF EXPRESSION
    // <        :   \<1
    // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
    // >        :   >1
    // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
    // </       :   <1/1
    // text     :   \k<xmlTag>
    // >        :   >1
    // (\w|\W)* :   Matches attributes if any

    // Sample match and pattern egs
    // Just to show how I incrementally made the patterns so that the final pattern is well-understood
    // <text>data</text>
    // @"^\<1(?<xmlTag>\w+)\>1.*\<1/1\k<xmlTag>\>1$";

    //<text />
    // @"^\<1(?<xmlTag>\w+)\s*/1\>1$";

    //<text>data</text> or <text />
    // @"^\<1(?<xmlTag>\w+)((\>1.*\<1/1\k<xmlTag>)|(\s*/1))\>1$";

    //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
    // @"^\<1(?<xmlTag>\w+)(((\w|\W)*\>1(?<data>.*)\<1/1\k<xmlTag>)|(\s*/1))\>1$";

    private const string XML_PATTERN = @"^\<1(?<xmlTag>\w+)(((\w|\W)*\>1(?<data>.*)\<1/1\k<xmlTag>)|(\s*/1))\>1$";

    // Checks if the string is in xml format
    private static bool IsXml(string value)
    
        return Regex.IsMatch(value, XML_PATTERN);
    

    /// <summary>
    /// Assigns the element value to result if the string is xml
    /// </summary>
    /// <returns>true if success, false otherwise</returns>
    public static bool TryParse(string s, out string result)
    
        if (XmlExpresssion.IsXml(s))
        
            Regex r = new Regex(XML_PATTERN, RegexOptions.Compiled);
            result = r.Match(s).Result("$data");
            return true;
        
        else
        
            result = null;
            return false;

调用代码：

if (!XmlExpresssion.TryParse(s, out result)) 
    result = s;
Console.WriteLine(result);

【讨论】：

我对此有些怀疑，因为 XML 不是一种常规语言，因此您无法使用正则表达式解析 XML：***.com/questions/6751105/… ...但是，对于识别 XML，如果您这样做，也许它可以工作没有进行完整的解析。如果字符串以 XML 声明开头，即<?xml version="1.0" encoding="UTF-8" ?>【参考方案3】：

更新：（原帖如下） Colin 有一个绝妙的想法，就是将正则表达式实例化移到调用之外，这样它们就只被创建一次。这是新程序：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace ConsoleApplication3

    delegate String xmltestFunc(String data);

    class Program
    
        static readonly int iterations = 1000000;

        private static void benchmark(xmltestFunc func, String data, String expectedResult)
        
            if (!func(data).Equals(expectedResult))
            
                Console.WriteLine(data + ": fail");
                return;
            
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iterations; ++i)
                func(data);
            sw.Stop();
            Console.WriteLine(data + ": " + (float)((float)sw.ElapsedMilliseconds / 1000));
        

        static void Main(string[] args)
        
            benchmark(xmltest1, "<tag>base</tag>", "base");
            benchmark(xmltest1, " <tag>base</tag> ", "base");
            benchmark(xmltest1, "base", "base");
            benchmark(xmltest2, "<tag>ColinBurnett</tag>", "ColinBurnett");
            benchmark(xmltest2, " <tag>ColinBurnett</tag> ", "ColinBurnett");
            benchmark(xmltest2, "ColinBurnett", "ColinBurnett");
            benchmark(xmltest3, "<tag>Si</tag>", "Si");
            benchmark(xmltest3, " <tag>Si</tag> ", "Si" );
            benchmark(xmltest3, "Si", "Si");
            benchmark(xmltest4, "<tag>RashmiPandit</tag>", "RashmiPandit");
            benchmark(xmltest4, " <tag>RashmiPandit</tag> ", "RashmiPandit");
            benchmark(xmltest4, "RashmiPandit", "RashmiPandit");
            benchmark(xmltest5, "<tag>Custom</tag>", "Custom");
            benchmark(xmltest5, " <tag>Custom</tag> ", "Custom");
            benchmark(xmltest5, "Custom", "Custom");

            // "press any key to continue"
            Console.WriteLine("Done.");
            Console.ReadLine();
        

        public static String xmltest1(String data)
        
            try
            
                return XElement.Parse(data).Value;
            
            catch (System.Xml.XmlException)
            
                return data;
            
        

        static Regex xmltest2regex = new Regex("^[ \t\r\n]*<");
        public static String xmltest2(String data)
        
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || xmltest2regex.Match(data).Success)
                
                    try
                    
                        return XElement.Parse(data).Value;
                    
                    catch (System.Xml.XmlException)
                    
                        return data;
                    
                
            
           return data;
        

        static Regex xmltest3regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
        public static String xmltest3(String data)
        
            Match m = xmltest3regex.Match(data);
            if (m.Success)
            
                GroupCollection gc = m.Groups;
                if (gc.Count > 0)
                
                    return gc["text"].Value;
                
            
            return data;
        

        public static String xmltest4(String data)
        
            String result;
            if (!XmlExpresssion.TryParse(data, out result))
                result = data;

            return result;
        

        static Regex xmltest5regex = new Regex("^[ \t\r\n]*<");
        public static String xmltest5(String data)
        
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || data.Trim()[0] == '<' || xmltest5regex.Match(data).Success)
                
                    try
                    
                        return XElement.Parse(data).Value;
                    
                    catch (System.Xml.XmlException)
                    
                        return data;
                    
                
            
            return data;
        
    

    public class XmlExpresssion
    
        // EXPLANATION OF EXPRESSION
        // <        :   \<1
        // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
        // >        :   >1
        // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
        // </       :   <1/1
        // text     :   \k<xmlTag>
        // >        :   >1
        // (\w|\W)* :   Matches attributes if any

        // Sample match and pattern egs
        // Just to show how I incrementally made the patterns so that the final pattern is well-understood
        // <text>data</text>
        // @"^\<1(?<xmlTag>\w+)\>1.*\<1/1\k<xmlTag>\>1$";

        //<text />
        // @"^\<1(?<xmlTag>\w+)\s*/1\>1$";

        //<text>data</text> or <text />
        // @"^\<1(?<xmlTag>\w+)((\>1.*\<1/1\k<xmlTag>)|(\s*/1))\>1$";

        //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
        // @"^\<1(?<xmlTag>\w+)(((\w|\W)*\>1(?<data>.*)\<1/1\k<xmlTag>)|(\s*/1))\>1$";

        private static string XML_PATTERN = @"^\<1(?<xmlTag>\w+)(((\w|\W)*\>1(?<data>.*)\<1/1\k<xmlTag>)|(\s*/1))\>1$";
        private static Regex regex = new Regex(XML_PATTERN, RegexOptions.Compiled);

        // Checks if the string is in xml format
        private static bool IsXml(string value)
        
            return regex.IsMatch(value);
        

        /// <summary>
        /// Assigns the element value to result if the string is xml
        /// </summary>
        /// <returns>true if success, false otherwise</returns>
        public static bool TryParse(string s, out string result)
        
            if (XmlExpresssion.IsXml(s))
            
                result = regex.Match(s).Result("$data");
                return true;
            
            else
            
                result = null;
                return false;

以下是新结果：

<tag>base</tag>: 3.667
 <tag>base</tag> : 3.707
base: 40.737
<tag>ColinBurnett</tag>: 3.707
 <tag>ColinBurnett</tag> : 4.784
ColinBurnett: 0.413
<tag>Si</tag>: 2.016
 <tag>Si</tag> : 2.141
Si: 0.087
<tag>RashmiPandit</tag>: 12.305
 <tag>RashmiPandit</tag> : fail
RashmiPandit: 0.131
<tag>Custom</tag>: 3.761
 <tag>Custom</tag> : 3.866
Custom: 0.329
Done.

你有它。预编译的正则表达式是要走的路，而且启动效率很高。

（原帖）

我拼凑了以下程序来对为此答案提供的代码示例进行基准测试，以展示我的帖子的推理以及评估隐私答案的速度。

事不宜迟，程序如下。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace ConsoleApplication3

    delegate String xmltestFunc(String data);

    class Program
    
        static readonly int iterations = 1000000;

        private static void benchmark(xmltestFunc func, String data, String expectedResult)
        
            if (!func(data).Equals(expectedResult))
            
                Console.WriteLine(data + ": fail");
                return;
            
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < iterations; ++i)
                func(data);
            sw.Stop();
            Console.WriteLine(data + ": " + (float)((float)sw.ElapsedMilliseconds / 1000));
        

        static void Main(string[] args)
        
            benchmark(xmltest1, "<tag>base</tag>", "base");
            benchmark(xmltest1, " <tag>base</tag> ", "base");
            benchmark(xmltest1, "base", "base");
            benchmark(xmltest2, "<tag>ColinBurnett</tag>", "ColinBurnett");
            benchmark(xmltest2, " <tag>ColinBurnett</tag> ", "ColinBurnett");
            benchmark(xmltest2, "ColinBurnett", "ColinBurnett");
            benchmark(xmltest3, "<tag>Si</tag>", "Si");
            benchmark(xmltest3, " <tag>Si</tag> ", "Si" );
            benchmark(xmltest3, "Si", "Si");
            benchmark(xmltest4, "<tag>RashmiPandit</tag>", "RashmiPandit");
            benchmark(xmltest4, " <tag>RashmiPandit</tag> ", "RashmiPandit");
            benchmark(xmltest4, "RashmiPandit", "RashmiPandit");

            // "press any key to continue"
            Console.WriteLine("Done.");
            Console.ReadLine();
        

        public static String xmltest1(String data)
        
            try
            
                return XElement.Parse(data).Value;
            
            catch (System.Xml.XmlException)
            
                return data;
            
        

        public static String xmltest2(String data)
        
            // Has to have length to be XML
            if (!string.IsNullOrEmpty(data))
            
                // If it starts with a < then it probably is XML
                // But also cover the case where there is indeterminate whitespace before the <
                if (data[0] == '<' || new Regex("^[ \t\r\n]*<").Match(data).Success)
                
                    try
                    
                        return XElement.Parse(data).Value;
                    
                    catch (System.Xml.XmlException)
                    
                        return data;
                    
                
            
           return data;
        

        public static String xmltest3(String data)
        
            Regex regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
            Match m = regex.Match(data);
            if (m.Success)
            
                GroupCollection gc = m.Groups;
                if (gc.Count > 0)
                
                    return gc["text"].Value;
                
            
            return data;
        

        public static String xmltest4(String data)
        
            String result;
            if (!XmlExpresssion.TryParse(data, out result))
                result = data;

            return result;
        

    

    public class XmlExpresssion
    
        // EXPLANATION OF EXPRESSION
        // <        :   \<1
        // text     :   (?<xmlTag>\w+)  : xmlTag is a backreference so that the start and end tags match
        // >        :   >1
        // xml data :   (?<data>.*)     : data is a backreference used for the regex to return the element data      
        // </       :   <1/1
        // text     :   \k<xmlTag>
        // >        :   >1
        // (\w|\W)* :   Matches attributes if any

        // Sample match and pattern egs
        // Just to show how I incrementally made the patterns so that the final pattern is well-understood
        // <text>data</text>
        // @"^\<1(?<xmlTag>\w+)\>1.*\<1/1\k<xmlTag>\>1$";

        //<text />
        // @"^\<1(?<xmlTag>\w+)\s*/1\>1$";

        //<text>data</text> or <text />
        // @"^\<1(?<xmlTag>\w+)((\>1.*\<1/1\k<xmlTag>)|(\s*/1))\>1$";

        //<text>data</text> or <text /> or <text attr='2'>xml data</text> or <text attr='2' attr2 >data</text>
        // @"^\<1(?<xmlTag>\w+)(((\w|\W)*\>1(?<data>.*)\<1/1\k<xmlTag>)|(\s*/1))\>1$";

        private const string XML_PATTERN = @"^\<1(?<xmlTag>\w+)(((\w|\W)*\>1(?<data>.*)\<1/1\k<xmlTag>)|(\s*/1))\>1$";

        // Checks if the string is in xml format
        private static bool IsXml(string value)
        
            return Regex.IsMatch(value, XML_PATTERN);
        

        /// <summary>
        /// Assigns the element value to result if the string is xml
        /// </summary>
        /// <returns>true if success, false otherwise</returns>
        public static bool TryParse(string s, out string result)
        
            if (XmlExpresssion.IsXml(s))
            
                Regex r = new Regex(XML_PATTERN, RegexOptions.Compiled);
                result = r.Match(s).Result("$data");
                return true;
            
            else
            
                result = null;
                return false;

这是结果。每一个都被执行了 100 万次。

<tag>base</tag>: 3.531
 <tag>base</tag> : 3.624
base: 41.422
<tag>ColinBurnett</tag>: 3.622
 <tag>ColinBurnett</tag> : 16.467
ColinBurnett: 7.995
<tag>Si</tag>: 19.014
 <tag>Si</tag> : 19.201
Si: 15.567

测试 4 耗时太长，因为 30 分钟后它被认为太慢了。为了证明它有多慢，这里是同一个测试只运行了 1000 次。

<tag>base</tag>: 0.004
 <tag>base</tag> : 0.004
base: 0.047
<tag>ColinBurnett</tag>: 0.003
 <tag>ColinBurnett</tag> : 0.016
ColinBurnett: 0.008
<tag>Si</tag>: 0.021
 <tag>Si</tag> : 0.017
Si: 0.014
<tag>RashmiPandit</tag>: 3.456
 <tag>RashmiPandit</tag> : fail
RashmiPandit: 0
Done.

推断一百万次处决，需要 3456 秒，即 57 分钟多一点。

这是一个很好的例子，说明如果您正在寻找高效的代码，为什么复杂的正则表达式不是一个好主意。然而，它表明在某些情况下，简单的正则表达式仍然是一个很好的答案 - 即 colinBurnett 答案中 xml 的小“预测试”创建了一个可能更昂贵的基本案例，（正则表达式是在案例 2 中创建的）但也更短的 else通过避免异常来区分大小写。

【讨论】：

使用静态字段来保存仅创建一次的正则表达式实例再试一次。我敢冒昧地说，有很大一部分时间是在反复实例化和编译正则表达式。实际上，if (data[0] == '<' || data.TrimStart()[0] == '<') 完全不需要我的示例中的正则表达式。后者正是 "^[ \t\r\n]* 添加了进行更改的自定义测试。【参考方案4】：

我发现这是一种完全可以接受的处理您的情况的方式（这可能也是我处理它的方式）。我在 MSDN 中找不到任何类型的“XElement.TryParse(string)”，所以你拥有它的方式就可以了。

【讨论】：

【参考方案5】：

除了执行 XElement.Parse 之类的操作之外，没有其他方法可以验证文本是否为 XML。例如，如果文本字段中缺少最后一个右尖括号，则它不是有效的 XML，而且您不太可能通过 RegEx 或文本解析发现这一点。有许多非法字符、非法序列等，RegEx 解析很可能会遗漏。

您所能做的就是缩短您的失败案例。

因此，如果您希望看到大量非 XML 数据并且不太预期的情况是 XML，那么使用 RegEx 或子字符串搜索来检测尖括号可能会为您节省一点时间，但我建议这是仅在您在紧密循环中批量处理大量数据时有用。

如果这是从 Web 表单或 winforms 应用程序解析用户输入的数据，那么我认为支付 Exception 的成本可能比花费开发和测试工作确保您的快捷代码不会更好产生假阳性/阴性结果。

不清楚您从哪里（文件、流、文本框或其他地方）获取 XML，但请记住，空格、cmets、字节顺序标记和其他东西可能会妨碍简单的规则，例如“它必须开始带有

【讨论】：

【参考方案6】：

为什么正则表达式很贵？不是一石两鸟（匹配解析）吗？

解析所有元素的简单示例，如果只是一个元素就更简单了！

Regex regex = new Regex(@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>");
MatchCollection matches = regex.Matches(data);
foreach (Match match in matches)

    GroupCollection groups = match.Groups;
    string name = groups["tag"].Value;
    string value = groups["text"].Value;
    ...

【讨论】：

请注意，这并不能验证 xml 是否有效（它可能在文本部分无效）它也不能确定所有标签都已关闭（使其无效XML）【参考方案7】：

正如@JustEngland 在评论异常中所指出的那样，异常并不昂贵，拦截它们的调试器可能需要时间，但通常它们表现良好且实践良好。见How expensive are exceptions in C#?。

更好的方法是滚动您自己的 TryParse 样式函数：

[System.Diagnostics.DebuggerNonUserCode]
static class MyXElement

    public static bool TryParse(string data, out XElement result)
    
        try
        
            result = XElement.Parse(data);
            return true;
        
        catch (System.Xml.XmlException)
        
            result = default(XElement);
            return false;

DebuggerNonUserCode 属性使调试器跳过捕获的异常以简化您的调试体验。

这样使用：

    static void Main()
    
        var addressList = "line one~line two~line three~postcode";

        var address = new XElement("Address");
        var addresshtml = "<span>" + addressList.Replace("~", "<br />") + "</span>";

        XElement content;
        if (MyXElement.TryParse(addressHtml, out content))
            address.ReplaceAll(content);
        else
            address.SetValue(addressHtml);

        Console.WriteLine(address.ToString());
        Console.ReadKey();

我更愿意为 TryParse 创建一个扩展方法，但是您不能创建一个调用类型而不是实例的静态方法。

【讨论】：

在框架中有一个 TryParse 会更好，但这是清理调试体验的有用技巧。【参考方案8】：

提示 -- 所有有效的 xml 必须以 "<?xml 开头"

您可能必须处理字符集差异，但检查纯 ASCII、utf-8 和 unicode 将覆盖 99.5% 的 xml。

【讨论】：

科林所说的 + OP 很可能适用于 XML 片段【参考方案9】：

如果您将在大多数 xml 未验证的循环中使用它，您建议的方式将会很昂贵，如果是验证 xml，您的代码将像没有异常处理一样工作......所以如果在大多数情况下在你的 xml 被验证或你没有在循环中使用它的情况下，你的代码可以正常工作

【讨论】：

【参考方案10】：

如果你想知道它是否有效，为什么不使用内置的 .NetFX 对象而不是从头开始编写一个呢？

希望对你有帮助，

比尔

【讨论】：

【参考方案11】：

Colin Burnett 技术的一种变体：您可以在开头做一个简单的正则表达式来查看文本是否以标签开头，然后尝试解析它。您将处理的以有效元素开头的字符串中可能有 >99% 是 XML。这样您就可以跳过对成熟的有效 XML 的正则表达式处理，也可以在几乎所有情况下跳过基于异常的处理。

^<[^>]+> 之类的东西可能会奏效。

【讨论】：

【参考方案12】：

我不确定您的要求是否考虑了文件格式，并且由于很久以前就有人问过这个问题并且我碰巧在搜索类似的东西，所以我希望您知道什么对我有用，所以如果有的话来到这里这可能会有所帮助:)

我们可以使用 Path.GetExtension(filePath) 并检查它是否是 XML，然后使用它来做任何需要的事情

【讨论】：

【参考方案13】：

怎么样，把你的字符串或对象扔进一个新的 XDocument 或 XElement 中。一切都使用 ToString() 解决。

【讨论】：

以上是关于检测 XML 的更好方法？的主要内容，如果未能解决你的问题，请参考以下文章