如何使用 Html Agility Pack 使请求超时

Posted

技术标签:

【中文标题】如何使用 Html Agility Pack 使请求超时【英文标题】:How to Timeout a request using Html Agility Pack 【发布时间】:2011-09-28 06:46:04 【问题描述】:

我正在向当前离线(故意)的远程 Web 服务器发出请求。

我想找出使请求超时的最佳方法。基本上,如果请求运行时间超过“X”毫秒,则退出请求并返回null 响应。

目前网络请求只是坐在那里等待响应.....

我将如何最好地解决这个问题?

这是当前的代码 sn-p

    public JsonpResult About(string HomePageUrl)
    
        Models.Pocos.About about = null;
        if (HomePageUrl.RemoteFileExists())
        
            // Using the html Agility Pack, we want to extract only the
            // appropriate data from the remote page.
            HtmlWeb hw = new HtmlWeb();
            HtmlDocument doc = hw.Load(HomePageUrl);
            HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[@class='wrapper1-border']");

            if (node != null)
             
                about = new Models.Pocos.About  html = node.InnerHtml ;
            
                //todo: look into whether this else statement is necessary
            else 
            
                about = null;
            
        

        return this.Jsonp(about);
    

【问题讨论】:

【参考方案1】:

通过此方法检索你的url网页:

private static string retrieveData(string url)
    
        // used to build entire input
        StringBuilder sb = new StringBuilder();

        // used on each read operation
        byte[] buf = new byte[8192];

        // prepare the web page we will be asking for
        HttpWebRequest request = (HttpWebRequest)
        WebRequest.Create(url);
        request.Timeout = 10; //10 millisecond
        // execute the request

        HttpWebResponse response = (HttpWebResponse)
        request.GetResponse();

        // we will read data via the response stream
        Stream resStream = response.GetResponseStream();

        string tempString = null;
        int count = 0;

        do
        
            // fill the buffer with data
            count = resStream.Read(buf, 0, buf.Length);

            // make sure we read some data
            if (count != 0)
            
                // translate from bytes to ASCII text
                tempString = Encoding.ASCII.GetString(buf, 0, count);

                // continue building the string
                sb.Append(tempString);
            
        
        while (count > 0); // any more data to read?

        return sb.ToString();
    

使用 HTML Agility 包并像这样检索 html 标记:

public static string htmlRetrieveInfo()
    
        string htmlSource = retrieveData("http://example.com/test.html");
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(htmlSource);
        if (doc.DocumentNode.SelectSingleNode("//body") != null)
        
          HtmlNode node = doc.DocumentNode.SelectSingleNode("//body");
        
        return node.InnerHtml;
    

【讨论】:

+1 感谢您的回复,它让我走上了正确的道路。我没有通过 HttpWebRequest 读取 Html,而是简单地向 RemoteFileExists 添加了超时 - see my answer @reggie:请注意,此代码的生产版本应使用 using 处理 IDisposable 之类的内容。【参考方案2】:

Html Agility Pack 是开源的。这就是为什么您可以自己修改源代码的原因。 首先将此代码添加到类 HtmlWeb

private int _timeout = 20000;

public int Timeout 
     
        get  return _timeout;  
        set
        
            if (_timeout < 1) 
                throw new ArgumentException("Timeout must be greater then zero.");
            _timeout = value;
        
    

然后找到这个方法

private HttpStatusCode Get(Uri uri, string method, string path, HtmlDocument doc, IWebProxy proxy, ICredentials creds)

并修改它:

req = WebRequest.Create(uri) as HttpWebRequest;
req.Method = method;
req.UserAgent = UserAgent;
req.Timeout = Timeout; //add this

或者类似的:

htmlWeb.PreRequest = request =>
            
                request.Timeout = 15000;
                return true;
            ;

【讨论】:

【参考方案3】:

我不得不对我最初发布的代码做一个小的调整

    public JsonpResult About(string HomePageUrl)
    
        Models.Pocos.About about = null;
        // ************* CHANGE HERE - added "timeout in milliseconds" to RemoteFileExists extension method.
        if (HomePageUrl.RemoteFileExists(1000))
        
            // Using the Html Agility Pack, we want to extract only the
            // appropriate data from the remote page.
            HtmlWeb hw = new HtmlWeb();
            HtmlDocument doc = hw.Load(HomePageUrl);
            HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[@class='wrapper1-border']");

            if (node != null)
             
                about = new Models.Pocos.About  html = node.InnerHtml ;
            
                //todo: look into whether this else statement is necessary
            else 
            
                about = null;
            
        

        return this.Jsonp(about);
    

然后我修改了我的 RemoteFileExists 扩展方法以设置超时

    public static bool RemoteFileExists(this string url, int timeout)
    
        try
        
            //Creating the HttpWebRequest
            HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;

            // ************ ADDED HERE
            // timeout the request after x milliseconds
            request.Timeout = timeout;
            // ************

            //Setting the Request method HEAD, you can also use GET too.
            request.Method = "HEAD";
            //Getting the Web Response.
            HttpWebResponse response = request.GetResponse() as HttpWebResponse;
            //Returns TRUE if the Status code == 200
            return (response.StatusCode == HttpStatusCode.OK);
        
        catch
        
            //Any exception will returns false.
            return false;
        
    

在这种方法中,如果我的超时在RemoteFileExists 可以确定标头响应之前触发,那么我的bool 将返回false。

【讨论】:

【参考方案4】:

您可以使用标准 HttpWebRequest 来获取远程资源并设置 Timeout 属性。如果成功,则将生成的 HTML 提供给 HTML Agility Pack 进行解析。

【讨论】:

System.Net.WebRequest 转换为HtmlAgilityPack.HtmlDocument 的正确方法是什么?

以上是关于如何使用 Html Agility Pack 使请求超时的主要内容,如果未能解决你的问题,请参考以下文章

Html Agility Pack/C#:如何创建/替换标签?

HTML Agility Pack - 使用 Align=left 样式从 DIV 获取文本

使用 HTML Agility Pack 替换 HTML div InnerText 标签

Html Agility Pack:查找评论节点

csharp Html Agility Pack #CSharp #HtmlParsing

使用 Html Agility Pack 从 HTML BODY 节点中提取内部文本