在 C# 中解析 html 的最佳方法是啥？ [关闭]

Posted 2023-02-15

技术标签:

【中文标题】在 C# 中解析 html 的最佳方法是啥？ [关闭]【英文标题】：What is the best way to parse html in C#? [closed]在 C# 中解析 html 的最佳方法是什么？ [关闭] 【发布时间】：2010-09-08 12:55:54 【问题描述】：

我正在寻找一种库/方法来解析具有比通用 xml 解析库更多的 html 特定功能的 html 文件。

【问题讨论】：

【参考方案1】：

您可以使用 HTML DTD 和通用 XML 解析库。

【讨论】：

很少有真实世界的 HTML 页面能在 XML 解析库中幸存下来。【参考方案2】：

解析 HTML 的问题在于它不是一门精确的科学。如果您要解析的是 XHTML，那么事情会容易得多（正如您提到的，您可以使用通用的 XML 解析器）。因为 HTML 不一定是格式良好的 XML，所以在解析它时会遇到很多问题。几乎需要逐个站点完成。

【讨论】：

W3C 规定的解析良好的 HTML 不是像 XHTML 一样精确吗？应该是，但人们不这样做。 @J. Pablo 虽然没有那么容易（因此是库的原因：p）...例如，<p> 标签不需要在 HTML4/5 下显式关闭。哎呀！【参考方案3】：

您可以使用 TidyNet.Tidy 将 HTML 转换为 XHTML，然后使用 XML 解析器。

另一种选择是使用内置引擎 mshtml：

using mshtml;
...
object[] oPageText =  html ;
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);

这允许您使用类似 javascript 的函数，例如 getElementById()

【讨论】：

叫我疯了，但我无法弄清楚如何使用 mshtml。你有什么好的链接吗？ @Alex 你需要包含 Microsoft.mshtml 可以在这里找到更多信息：msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx 我有一篇关于 Tidy.Net 和 ManagedTidy 的博文都能够解析和验证 (x)html 文件。如果你不需要验证东西。我会选择 htmlagilitypack。 jphellemons.nl/post/…【参考方案4】：

我认为@Erlend 使用HTMLDocument 是最好的 方式。不过，我也很幸运地使用了这个简单的库：

SgmlReader

【讨论】：

【参考方案5】：

您可以做很多事情而不必对第 3 方产品和 mshtml（即互操作）发疯。使用 System.Windows.Forms.WebBrowser。从那里，您可以在 HtmlDocument 上执行“GetElementById”或在 HtmlElements 上执行“GetElementsByTagName”等操作。如果您想实际与浏览器交互（例如模拟按钮点击），您可以使用一点反射（imo 比 Interop 更邪恶）来做到这一点：

var wb = new WebBrowser()

... 告诉浏览器导航（与此问题相关）。然后在 Document_Completed 事件上，您可以模拟这样的点击。

var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);

你可以做类似的反射工作来提交表单等。

享受吧。

【讨论】：

【参考方案6】：

我过去曾使用ZetaHtmlTidy 加载随机网站，然后使用xpath 访问内容的各个部分（例如/html/body//p[@class='textblock']）。它运行良好，但有一些特殊的网站存在问题，所以我不知道这是否是绝对最佳的解决方案。

【讨论】：

【参考方案7】：

Html Agility Pack

这是一个敏捷的 HTML 解析器，它构建一个读/写 DOM 并支持普通的 XPATH 或 XSLT（实际上你不必了解 XPATH 或 XSLT 就可以使用它，不用担心......）。它是一个 .NET 代码库，可让您解析“网络之外”的 HTML 文件。解析器对“真实世界”格式错误的 HTML 非常宽容。对象模型与 System.Xml 的提议非常相似，但用于 HTML 文档（或流）。

【讨论】：

【参考方案8】：

之前已经提到过 Html Agility Pack - 如果您追求速度，您可能还想查看the Majestic-12 HTML parser。它的处理相当笨拙，但它提供了非常快速的解析体验。

【讨论】：

【参考方案9】：

我编写了一些提供“LINQ to HTML”功能的代码。我想我会在这里分享它。它基于 Majestic 12。它采用 Majestic-12 结果并生成 LINQ XML 元素。此时，您可以针对 HTML 使用所有 LINQ to XML 工具。举个例子：

        IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);

        foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) 

            if (anchorTag.Attribute("href") == null)
                continue;

            Console.WriteLine(anchorTag.Attribute("href").Value);

我想使用 Majestic-12，因为我知道它有很多关于在野外发现的 HTML 的内置知识。但我发现，将 Majestic-12 结果映射到 LINQ 将接受的东西，因为 XML 需要额外的工作。我包含的代码做了很多清理工作，但是当您使用它时，您会发现页面被拒绝。您需要修复代码以解决该问题。当抛出异常时，检查 exception.Data["source"] 因为它可能设置为导致异常的 HTML 标记。以良好的方式处理 HTML 有时并非易事...

所以现在期望值实际上很低，这里是代码:)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace Majestic12ToXml 
public class Majestic12ToXml 

    static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) 

        HTMLparser parser = OpenParser();
        parser.Init(htmlAsBytes);

        XElement currentNode = new XElement("document");

        HTMLchunk m12chunk = null;

        int xmlnsAttributeIndex = 0;
        string originalHtml = "";

        while ((m12chunk = parser.ParseNext()) != null) 

            try 

                Debug.Assert(!m12chunk.bHashMode);  // popular default for Majestic-12 setting

                XNode newNode = null;
                XElement newNodesParent = null;

                switch (m12chunk.oType) 
                    case HTMLchunkType.OpenTag:

                        // Tags are added as a child to the current tag, 
                        // except when the new tag implies the closure of 
                        // some number of ancestor tags.

                        newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                        if (newNode != null) 
                            currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                            newNodesParent = currentNode;

                            newNodesParent.Add(newNode);

                            currentNode = newNode as XElement;
                        

                        break;

                    case HTMLchunkType.CloseTag:

                        if (m12chunk.bEndClosure) 

                            newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);

                            if (newNode != null) 
                                currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);

                                newNodesParent = currentNode;
                                newNodesParent.Add(newNode);
                            
                        
                        else 
                            XElement nodeToClose = currentNode;

                            string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

                            while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
                                nodeToClose = nodeToClose.Parent;

                            if (nodeToClose != null)
                                currentNode = nodeToClose.Parent;

                            Debug.Assert(currentNode != null);
                        

                        break;

                    case HTMLchunkType.Script:

                        newNode = new XElement("script", "REMOVED");
                        newNodesParent = currentNode;
                        newNodesParent.Add(newNode);
                        break;

                    case HTMLchunkType.Comment:

                        newNodesParent = currentNode;

                        if (m12chunk.sTag == "!--")
                            newNode = new XComment(m12chunk.oHTML);
                        else if (m12chunk.sTag == "![CDATA[")
                            newNode = new XCData(m12chunk.oHTML);
                        else
                            throw new Exception("Unrecognized comment sTag");

                        newNodesParent.Add(newNode);

                        break;

                    case HTMLchunkType.Text:

                        currentNode.Add(m12chunk.oHTML);
                        break;

                    default:
                        break;
                
            
            catch (Exception e) 
                var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);

                // the original html is copied for tracing/debugging purposes
                originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
                    .Take(m12chunk.iChunkLength)
                    .Select(B => (char)B).ToArray()); 

                wrappedE.Data.Add("source", originalHtml);

                throw wrappedE;
            
        

        while (currentNode.Parent != null)
            currentNode = currentNode.Parent;

        return currentNode.Nodes();
    

    static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) 

        string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);

        XElement discoveredParent = null;

        // Get a list of all ancestors
        List<XElement> ancestors = new List<XElement>();
        XElement ancestor = nextPotentialParent;
        while (ancestor != null) 
            ancestors.Add(ancestor);
            ancestor = ancestor.Parent;
        

        // Check if the new tag implies a previous tag was closed.
        if ("form" == m12chunkCleanedTag) 

            discoveredParent = ancestors
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        
        else if ("td" == m12chunkCleanedTag) 

            discoveredParent = ancestors
                .TakeWhile(XE => "tr" != XE.Name)
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        
        else if ("tr" == m12chunkCleanedTag) 

            discoveredParent = ancestors
                .TakeWhile(XE => !("table" == XE.Name
                                    || "thead" == XE.Name
                                    || "tbody" == XE.Name
                                    || "tfoot" == XE.Name))
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        
        else if ("thead" == m12chunkCleanedTag
                  || "tbody" == m12chunkCleanedTag
                  || "tfoot" == m12chunkCleanedTag) 


            discoveredParent = ancestors
                .TakeWhile(XE => "table" != XE.Name)
                .Where(XE => m12chunkCleanedTag == XE.Name)
                .Take(1)
                .Select(XE => XE.Parent)
                .FirstOrDefault();
        

        return discoveredParent ?? nextPotentialParent;
    

    static string CleanupTagName(string originalName, string originalHtml) 

        string tagName = originalName;

        tagName = tagName.TrimStart(new char[]  '?' );  // for nodes <?xml >

        if (tagName.Contains(':'))
            tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);

        return tagName;
    

    static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);

    static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) 

        result = null;
        string attributeName = originalName;

        if (string.IsNullOrEmpty(originalName))
            return false;

        if (_startsAsNumeric.IsMatch(originalName))
            return false;

        //
        // transform xmlns attributes so they don't actually create any XML namespaces
        //
        if (attributeName.ToLower().Equals("xmlns")) 

            attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
            xmlnsIndex++;
        
        else 
            if (attributeName.ToLower().StartsWith("xmlns:")) 
                attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
               

            //
            // trim trailing \"
            //
            attributeName = attributeName.TrimEnd(new char[]  '\"' );

            attributeName = attributeName.Replace(":", "_");
        

        result = attributeName;

        return true;
    

    static Regex _weirdTag = new Regex(@"^<!\[.*\]>$");       // matches "<![if !supportEmptyParas]>"
    static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
    static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches "<!-Extra_Images->"

    static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) 

        if (string.IsNullOrEmpty(m12chunk.sTag)) 

            if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
                return new XElement("doctype");

            if (_weirdTag.IsMatch(originalHtml))
                return new XElement("REMOVED_weirdBlockParenthesisTag");

            if (_aspnetPrecompiled.IsMatch(originalHtml))
                return new XElement("REMOVED_ASPNET_PrecompiledDirective");

            if (_shortHtmlComment.IsMatch(originalHtml))
                return new XElement("REMOVED_ShortHtmlComment");

            // Nodes like "<br <br>" will end up with a m12chunk.sTag==""...  We discard these nodes.
            return null;
        

        string tagName = CleanupTagName(m12chunk.sTag, originalHtml);

        XElement result = new XElement(tagName);

        List<XAttribute> attributes = new List<XAttribute>();

        for (int i = 0; i < m12chunk.iParams; i++) 

            if (m12chunk.sParams[i] == "<!--") 

                // an HTML comment was embedded within a tag.  This comment and its contents
                // will be interpreted as attributes by Majestic-12... skip this attributes
                for (; i < m12chunk.iParams; i++) 

                    if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
                        break;
                

                continue;
            

            if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
                continue;

            string attributeName = m12chunk.sParams[i];

            if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
                continue;

            attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
        

        // If attributes are duplicated with different values, we complain.
        // If attributes are duplicated with the same value, we remove all but 1.
        var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);

        foreach (var duplicatedAttribute in duplicatedAttributes) 

            if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
                throw new Exception("Attribute value was given different values");

            attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
            attributes.Add(duplicatedAttribute.First());
        

        result.Add(attributes);

        return result;
    

    static HTMLparser OpenParser() 
        HTMLparser oP = new HTMLparser();

        // The code+comments in this function are from the Majestic-12 sample documentation.

        // ...

        // This is optional, but if you want high performance then you may
        // want to set chunk hash mode to FALSE. This would result in tag params
        // being added to string arrays in HTMLchunk object called sParams and sValues, with number
        // of actual params being in iParams. See code below for details.
        //
        // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
        oP.SetChunkHashMode(false);

        // if you set this to true then original parsed HTML for given chunk will be kept - 
        // this will reduce performance somewhat, but may be desireable in some cases where
        // reconstruction of HTML may be necessary
        oP.bKeepRawHTML = false;

        // if set to true (it is false by default), then entities will be decoded: this is essential
        // if you want to get strings that contain final representation of the data in HTML, however
        // you should be aware that if you want to use such strings into output HTML string then you will
        // need to do Entity encoding or same string may fail later
        oP.bDecodeEntities = true;

        // we have option to keep most entities as is - only replace stuff like &nbsp; 
        // this is called Mini Entities mode - it is handy when HTML will need
        // to be re-created after it was parsed, though in this case really
        // entities should not be parsed at all
        oP.bDecodeMiniEntities = true;

        if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
            oP.InitMiniEntities();

        // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
        // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
        // this only works if auto extraction is enabled
        oP.bAutoExtractBetweenTagsOnly = true;

        // if true then comments will be extracted automatically
        oP.bAutoKeepComments = true;

        // if true then scripts will be extracted automatically: 
        oP.bAutoKeepScripts = true;

        // if this option is true then whitespace before start of tag will be compressed to single
        // space character in string: " ", if false then full whitespace before tag will be returned (slower)
        // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
        // a waste of CPU cycles
        oP.bCompressWhiteSpaceBeforeTag = true;

        // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
        // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
        // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
        // or open
        oP.bAutoMarkClosedTagsWithParamsAsOpen = false;

        return oP;

【讨论】：

btw HtmlAgilityPack 过去对我来说效果很好，我只是更喜欢 LINQ。添加 LINQ 转换后的性能如何？知道它与 HtmlAgilityPack 相比如何吗？我从来没有做过性能比较。这些天我使用 HtmlAgilityPack，少了很多麻烦。不幸的是，上面的代码有很多我懒得写测试的特殊情况，所以我无法真正维护它。【参考方案10】：

如果您需要查看 JS 对页面的影响[并且您准备启动浏览器]，请使用 WatiN

【讨论】：

【参考方案11】：

我发现了一个名为 Fizzler 的项目，它采用 jQuery/Sizzler 方法来选择 HTML 元素。它基于 HTML 敏捷包。它目前处于测试阶段，仅支持 CSS 选择器的子集，但在讨厌的 XPath 上使用 CSS 选择器非常酷且令人耳目一新。

http://code.google.com/p/fizzler/

【讨论】：

谢谢，这看起来很有趣！我很惊讶，jQuery 如此受欢迎，以至于很难找到一个受它启发的 C# 项目。现在，如果我能找到一些文档操作和更高级的遍历也是包的一部分... :) 我今天才用这个，不得不说，如果你懂jQuery的话，用起来还是很方便的。【参考方案12】：

根据您的需要，您可能会选择功能更丰富的库。我尝试了大多数/所有建议的解决方案，但最突出的是 Html Agility Pack。这是一个非常宽容和灵活的解析器。

【讨论】：

【参考方案13】：

试试这个脚本。

http://www.biterscripting.com/SS_URLs.html

当我使用这个网址时，

script SS_URLs.txt URL("http://***.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")

它显示了该线程页面上的所有链接。

http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.

您可以修改该脚本以检查图像、变量等。

【讨论】：

【参考方案14】：

我在 C# 中编写了一些用于解析 HTML 标签的类。如果它们满足您的特定需求，它们会很简单。

您可以阅读有关它们的文章并在http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c 下载源代码。

http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class 上还有一篇关于通用解析助手类的文章。

【讨论】：

【参考方案15】：

没有 3rd 方库，可以在 Console 和 Asp.net 上运行的 WebBrowser 类解决方案

using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;

class ParseHTML

    public ParseHTML()  
    private string ReturnString;

    public string doParsing(string html)
    
        Thread t = new Thread(TParseMain);
        t.ApartmentState = ApartmentState.STA;
        t.Start((object)html);
        t.Join();
        return ReturnString;
    

    private void TParseMain(object html)
    
        WebBrowser wbc = new WebBrowser();
        wbc.DocumentText = "feces of a dummy";        //;magic words        
        HtmlDocument doc = wbc.Document.OpenNew(true);
        doc.Write((string)html);
        this.ReturnString = doc.Body.InnerHtml + " do here something";
        return;

用法：

string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);

【讨论】：

以上是关于在 C# 中解析 html 的最佳方法是啥？ [关闭]的主要内容，如果未能解决你的问题，请参考以下文章

在 C# 代码中解析（大）XML 的最佳方法是啥？

在 C# 中解析大型 XML（大小为 1GB）的最佳方法是啥？

在 C# 字符串中的 HTML 中搜索特定文本并标记文本的最佳方法是啥？

在谷歌应用程序脚本中解析 html 的最佳方法是啥

c# 在 c# 应用程序中保存配置数据的最佳方法是啥。 [复制]

在 c# 中比较两个 pdf 文件的最佳方法是啥？