如何在 C# 中下载 HTML 源代码

Posted 2023-03-05

技术标签:

【中文标题】如何在 C# 中下载 HTML 源代码【英文标题】：How can I download HTML source in C# 【发布时间】：2010-10-10 14:25:10 【问题描述】：

如何在 C# 中获取给定网址的 html 源代码？

【问题讨论】：

【参考方案1】：

您可以使用WebClient class下载文件：

using System.Net;

using (WebClient client = new WebClient ()) // WebClient class inherits IDisposable

    client.DownloadFile("http://yoursite.com/page.html", @"C:\localfile.html");

    // Or you can get the file content without saving it
    string htmlCode = client.DownloadString("http://yoursite.com/page.html");

【讨论】：

应注意：如果需要更多控制，请查看 HttpWebRequest 类（例如，能够指定身份验证）。是的，HttpWebRequest 为您提供更多控制权，尽管您可以使用 WebClient 执行 POST 请求，使用 client.UploadData(uriString,"POST",postParamsByteArray); 在这附近捕获 WebException 不是很谨慎吗？也许这是假设的。是否需要使用此方法捕获任何其他异常或错误？ @JohnWasham - 是的，在这里捕获异常是谨慎的。然而，值得庆幸的是，大多数 *** 受访者都尽可能保持示例代码简洁明了。让示例代码更接近“现实生活”只会增加噪音。我面临的问题是，当我下载页面源并获取数据时，如果该网站使用其他语言而不是我的页面源没有获得这些值【参考方案2】：

基本上：

using System.Net;
using System.Net.Http;  // in LINQPad, also add a reference to System.Net.Http.dll

WebRequest req = HttpWebRequest.Create("http://google.com");
req.Method = "GET";

string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))

    source = reader.ReadToEnd();


Console.WriteLine(source);

【讨论】：

【参考方案3】：

最新的，最新的，最新的答案 这篇文章真的很老了（我回答的时候已经 7 岁了），所以没有使用其他答案新的推荐方式，即HttpClient class。

HttpClient 被认为是新的 API，它应该替换旧的 API（WebClient 和 WebRequest）

string url = "page url";
HttpClient client = new HttpClient();
using (HttpResponseMessage response = client.GetAsync(url).Result)

   using (HttpContent content = response.Content)
   
      string result = content.ReadAsStringAsync().Result;

有关如何使用HttpClient 类的更多信息（尤其是在异步情况下），您可以参考this question

注意 1：如果你想使用 async/await

string url = "page url";
HttpClient client = new HttpClient();   // actually only one object should be created by Application
using (HttpResponseMessage response = await client.GetAsync(url))

   using (HttpContent content = response.Content)
   
      string result = await content.ReadAsStringAsync();

注意 2：如果使用 C# 8 功能

string url = "page url";
HttpClient client = new HttpClient();
using HttpResponseMessage response = await client.GetAsync(url);
using HttpContent content = response.Content;
string result = await content.ReadAsStringAsync();

【讨论】：

建议：等待异步方法。 @Maarten 以下链接显示了如何将其与 async/await 一起使用 ***.com/questions/33020657/… 在这里使用异步调用有什么好处？我认为总是建议尽可能使用异步，因为这可能需要一些时间，并且您不想使用 Wait() 调用阻塞线程谢谢。使用HttpClient 比WebClient 快得多。【参考方案4】：

您可以通过以下方式获取 HTML 源代码：

var html = new System.Net.WebClient().DownloadString(siteUrl)

【讨论】：

又短又甜！在阅读了 Joe Albahari 的示例后，我找到了您的建议。 LINQPad > 帮助 > 新增功能，然后搜索缓存。 var html = new System.Net.WebClient().DownloadString(siteUrl); // 需要更新你的客户端！ Dispose 是 WebClient 吗？【参考方案5】：

@cms 方法是较新的，在 MS 网站上建议，但我有一个很难解决的问题，两种方法都在这里发布，现在我将解决方案发布给大家！

问题： 如果您使用这样的网址：www.somesite.it/?p=1500 在某些情况下您会收到内部服务器错误 (500)，尽管在网络浏览器中，这 www.somesite.it/?p=1500 完美工作。

解决方案： 你必须移出参数，工作代码是：

using System.Net;
//...
using (WebClient client = new WebClient ()) 

    client.QueryString.Add("p", "1500"); //add parameters
    string htmlCode = client.DownloadString("www.somesite.it");
    //...

here official documentation

【讨论】：

使用 DownloadString 时请小心，因为如果网站不使用 UTF-8，它会破坏编码。改用 DownloadData 方法并处理编码部分。

以上是关于如何在 C# 中下载 HTML 源代码的主要内容，如果未能解决你的问题，请参考以下文章