使用 XPath 和 WebBrowser 控件选择多个节点

Posted

技术标签:

【中文标题】使用 XPath 和 WebBrowser 控件选择多个节点【英文标题】:Using XPath and WebBrowser Control to select multiple nodes 【发布时间】:2014-10-10 21:34:36 【问题描述】:

在 C# WinForms 示例应用程序中,我使用 WebBrowser 控件和 javascript-XPath 选择单个节点并通过以下代码更改该节点 .innerhtml

    private void MainForm_Load(object sender, EventArgs e)
    
        webBrowser1.DocumentText = @"
            <html>
            <head>
                <script src=""http://svn.coderepos.org/share/lang/javascript/javascript-xpath/trunk/release/javascript-xpath-latest-cmp.js""></script>
            </head>
            <body>
            <img 0764547763 Product Details"" 
                src=""http://ecx.images-amazon.com/images/I/51AK1MRIi7L._AA160_.jpg"">
            <hr/>
            <h2>Product Details</h2>
            <ul>
            <li><b>Paperback:</b> 648 pages</li>
            <li><b>Publisher:</b> Wiley; Unlimited Edition edition (October 15, 2001)</li>
            <li><b>Language:</b> English</li>
            <li><b>ISBN-10:</b> 0764547763</li>
            </ul>
            </body>
            </html>
        ";
    

    private void cmdTest_Click(object sender, EventArgs e)
    
        string xPath = "//li";
        string code = string.Format("document.evaluate('0', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;", xPath);
        var li = webBrowser1.Document.InvokeScript("eval", new object[]  code ) as mshtml.IHTMLElement;

        li.innerHTML = string.Format("<span style='text-transform: uppercase;font-family:verdana;color:green;'>0</span>", li.innerText);

    

这段代码运行结果如下:

现在我想使用相同的技术在&lt;ul&gt; 节点下选择多个&lt;li&gt;nodes,我正在写:

        xPath = "//ul//*";
        code = string.Format("document.evaluate('0', document, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);", xPath);
        var allLI = webBrowser1.Document.InvokeScript("eval", new object[]  code ) as mshtml.IHTMLElementCollection;

但是allLI变量的返回值是NULL

如果我会写

        xPath = "//ul//*";
        code = string.Format("document.evaluate('0', document, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);", xPath);
        var allLI = webBrowser1.Document.InvokeScript("eval", new object[]  code ); 

那么返回的 allLI 变量不是 null 并且它的值类型是 COM Object 但是这个 COM Object 可以转换为更具体的类型我不清楚。

有没有办法通过这里使用的技术来选择多个节点?

[已编辑]

xPath = "ul//*";

xPath = "//ul//*";

[加法]

我在示例 HTML 中添加了两个 javaScript 函数:

<script type=""text/javascript"">
    function GetElementsText (XPath) 
            var xPathRes = document.evaluate ( XPath, document, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);              
            var nextElement = xPathRes.iterateNext ();
            var text = """";
            while (nextElement) 
               text += nextElement.innerText;
               nextElement = xPathRes.iterateNext ();
            
        return text;
        ;

    function GetElements (XPath) 
            var xPathRes = document.evaluate ( XPath, document, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);              
            var nextElement = xPathRes.iterateNext ();
            var elements = new Object();
            var elementIndex = 1;
            while (nextElement) 
               elements[elementIndex++] = nextElement;
               nextElement = xPathRes.iterateNext ();
            
        return elements;
        ;
</script>

现在,当我在 cmd_TestClick 方法中运行以下 C# 代码行时:

var text = webBrowser1.Document.InvokeScript("eval", new object[]  "GetElementsText('//ul')" );

我正在获取所有 li 元素的文本:

"Paperback: 648 pages \r\nPublisher: Wiley; Unlimited Edition edition (October 15, 2001) \r\nLanguage: English \r\nISBN-10: 0764547763 "

当我在 cmd_TestClick 方法中运行以下 C# 代码行时:

var elements = webBrowser1.Document.InvokeScript("eval", new object[]  "GetElements('//ul')" );

我收到了COM Object,我无法将其转换为IEnumerable&lt;mshtml.IHtmlElement&gt;

有没有办法在 C# 代码中处理由返回的 HTML 节点的 JavaScript 集合

var elements = webBrowser1.Document.InvokeScript("eval", new object[]  "GetElements('//ul')" );

?

【问题讨论】:

这有帮助吗? ***.com/a/20783420/1768303 @Noseratio:我想避免使用 HTML Agility Pack - 我想通过 mshtml.IHTMLElement 和/或 mshtml.IHTMLElementCollection 通过 mshtml.IHTMLElementCollection 直接操作 WebBrowser 控件的 DOM 内容。 【参考方案1】:

我找到了解决办法,代码如下:

using System;
using System.Collections.Generic;
using System.Reflection;
using System.Windows.Forms;

namespace myTest.WinFormsApp

public partial class MainForm : Form

    public MainForm()
    
        InitializeComponent();
    

    private void MainForm_Load(object sender, EventArgs e)
    
        webBrowser1.DocumentText = @"
            <html>
            <body>
            <img 0764547763 Product Details"" 
                src=""http://ecx.images-amazon.com/images/I/51AK1MRIi7L._AA160_.jpg"">
            <hr/>
            <h2>Product Details</h2>
            <ul>
            <li><b>Paperback:</b> 648 pages</li>
            <li><b>Publisher:</b> Wiley; Unlimited Edition edition (October 15, 2001)</li>
            <li><b>Language:</b> English</li>
            <li><b>ISBN-10:</b> 0764547763</li>
            </html>
        ";
    

    private void cmdTest_Click(object sender, EventArgs e)
    
        var processor = new WebBrowserControlXPathQueriesProcessor(webBrowser1);

        // change attributes of the first element of the list
        
            var li = processor.GetHtmlElement("//li");
            li.innerHTML = string.Format("<span style='text-transform: uppercase;font-family:verdana;color:green;'>0</span>", li.innerText);
        

        // change attributes of the second and subsequent elements of the list
        var list = processor.GetHtmlElements("//ul//li");
        int index = 1;
        foreach (var li in list)
        
            if (index++ == 1) continue;
            li.innerHTML = string.Format("<span style='text-transform: uppercase;font-family:verdana;color:blue;'>0</span>", li.innerText);
        

    

    /// <summary>
    /// Enables IE WebBrowser control to evaluate XPath queries 
    /// by injecting http://svn.coderepos.org/share/lang/javascript/javascript-xpath/trunk/release/javascript-xpath-latest-cmp.js
    /// and to return XPath queries results to the calling C# code as strongly typed
    /// mshtml.IHTMLElement and IEnumerable<mshtml.IHTMLElement>
    /// </summary>
    public class WebBrowserControlXPathQueriesProcessor
    
        private System.Windows.Forms.WebBrowser _webBrowser;
        public WebBrowserControlXPathQueriesProcessor(System.Windows.Forms.WebBrowser webBrowser)
        
            _webBrowser = webBrowser;
            injectScripts();
        

        private void injectScripts()
        
            // Thanks to: http://***.com/questions/7998996/how-to-inject-javascript-in-webbrowser-control

            HtmlElement head = _webBrowser.Document.GetElementsByTagName("head")[0];
            HtmlElement scriptEl = _webBrowser.Document.CreateElement("script");
            mshtml.IHTMLScriptElement element = (mshtml.IHTMLScriptElement)scriptEl.DomElement;
            element.src = "http://svn.coderepos.org/share/lang/javascript/javascript-xpath/trunk/release/javascript-xpath-latest-cmp.js";
            head.AppendChild(scriptEl);

            string javaScriptText = @"
                    function GetElements (XPath) 
                            var xPathRes = document.evaluate ( XPath, document, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);              
                            var nextElement = xPathRes.iterateNext ();
                            var elements = new Object();
                            var elementIndex = 1;
                            while (nextElement) 
                            elements[elementIndex++] = nextElement;
                            nextElement = xPathRes.iterateNext ();
                            
                        elements.length = elementIndex -1;
                        return elements;
                        ;
                   ";
            scriptEl = _webBrowser.Document.CreateElement("script");
            element = (mshtml.IHTMLScriptElement)scriptEl.DomElement;
            element.text = javaScriptText;
            head.AppendChild(scriptEl);
        

        /// <summary>
        /// Gets Html element's mshtml.IHTMLElement object instance using XPath query
        /// </summary>
        public mshtml.IHTMLElement GetHtmlElement(string xPathQuery)
        
            string code = string.Format("document.evaluate('0', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;", xPathQuery);
            return _webBrowser.Document.InvokeScript("eval", new object[]  code ) as mshtml.IHTMLElement;
        

        /// <summary>
        /// Gets Html elements' IEnumerable<mshtml.IHTMLElement> object instance using XPath query
        /// </summary>
        public IEnumerable<mshtml.IHTMLElement> GetHtmlElements(string xPathQuery)
        
            // Thanks to: http://***.com/questions/5278275/accessing-properties-of-javascript-objects-using-type-dynamic-in-c-sharp-4
            var comObject = _webBrowser.Document.InvokeScript("eval", new object[]  string.Format("GetElements('0')", xPathQuery) );
            Type type = comObject.GetType();
            int length = (int)type.InvokeMember("length", BindingFlags.GetProperty, null, comObject, null);

            for (int i = 1; i <= length; i++)
            
                yield return type.InvokeMember(i.ToString(), BindingFlags.GetProperty, null, comObject, null) as mshtml.IHTMLElement;
            
        
    



下面是代码运行结果:

我已将学分的引用嵌入到我的代码中。如果您发现我遗漏了一些,请在您的 cmets 中指出我,我会添加它们。

如果您知道更好的解决方案 - 更短的代码,更有效的代码 - 请评论和/或发布您的答案。

【讨论】:

这个用元素填充数组的 js 不适用于 google.com/… 站点,它为 xpath //div[@class='_pl _ki']/descendant-or-self::text()[1] 提供截断的业务名称,仅像 Broadway 而不是 Broadway Chiropractic &amp; Wellness 对于一个特定示例,此解决方案返回 null 而在 Chrome 中执行的相同 JavaScript 返回正确的元素 另外,您可能希望 XPath 用双引号括起来,因为 XPath 可能包含双引号而不是简单引号:string code = string.Format("document.evaluate(\"0\", document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;", xPathQuery); instread of document.evaluate('0' ...

以上是关于使用 XPath 和 WebBrowser 控件选择多个节点的主要内容,如果未能解决你的问题,请参考以下文章

WebBrowser 控件和 JavaScript 错误

VB使用webbrowser控件时怎样释放内存?我使用了许多webbrowser数组时,只见占用内存越来越大。最后崩溃

WebBrowser 控件和 Wininet API

如何使用webbrowser控件获取网页源代码

webbrowser控件如何获取网页回传的数据

在 WebBrowser 控件中禁用警报窗口