如何使用 XPath 挖掘(其他人的)编码不佳的 HTML?
Posted
技术标签:
【中文标题】如何使用 XPath 挖掘(其他人的)编码不佳的 HTML?【英文标题】:How can XPath be used to dig through (someone else's) poorly coded HTML? 【发布时间】:2013-05-28 14:42:58 【问题描述】:当我执行这个 C# 代码时...
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.IO;
using System.Net;
using System.Dynamic;
using htmlAgilityPack;
namespace entropedizer
public partial class Form1 : Form
public Form1()
InitializeComponent();
public String postRequest(string url, string eventTarget)
// A "pre-request," sent to gather SessionID and POST data parameters for the main request
HttpWebRequest prequest = (HttpWebRequest)WebRequest.Create("http://www.entropedia.info/Chart.aspx?chart=Chart");
HttpWebResponse presponse = (HttpWebResponse)prequest.GetResponse();
Stream pstream = presponse.GetResponseStream();
StreamReader psr = new StreamReader(pstream);
string phtml = psr.ReadToEnd();
Match viewstate = Regex.Match(phtml, "id=\"__VIEWSTATE\".+/>");
Match eventvalidation = Regex.Match(phtml, "id=\"__EVENTVALIDATION\".+/>");
ASCIIEncoding encoding = new ASCIIEncoding();
string postData = "__EVENTTARGET=" + eventTarget + "&__VIEWSTATE=" + Uri.EscapeDataString(viewstate.ToString().Substring(24, viewstate.Length - 28)) + "&__EVENTVALIDATION=" + Uri.EscapeDataString(eventvalidation.ToString().Substring(30, eventvalidation.Length - 34));
byte[] data = encoding.GetBytes(postData);
// The main request, intended to retreive the desired HTML
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.entropedia.info/Chart.aspx?chart=Chart");
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.CookieContainer = new CookieContainer();
Cookie sessionId = new Cookie("ASP.NET_SessionId", Regex.Match(presponse.Headers.ToString(), "ASP.NET_SessionId=.+ d").ToString().Substring(18, Regex.Match(presponse.Headers.ToString(), "ASP.NET_SessionId=.+ d").Length - 21), "/", ".entropedia.info");
request.CookieContainer.Add(new Uri("http://www.entropedia.info/Chart.aspx?chart=Chart"), sessionId);
Stream stream = request.GetRequestStream();
stream.Write(data, 0, data.Length);
stream.Close();
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
stream = response.GetResponseStream();
StreamReader sr = new StreamReader(stream);
return sr.ReadToEnd();
private void Form1_Load(object sender, EventArgs e)
System.Net.ServicePointManager.Expect100Continue = false;
HtmlAgilityPack.HtmlDocument hChart = new HtmlAgilityPack.HtmlDocument();
hChart.LoadHtml(postRequest("http://www.entropedia.info/Chart.aspx?chart=Chart", "ctl00%24ContentPlaceHolder1%24DG1%24ctl19%24ctl05"));
HtmlNodeCollection chartStrings = hChart.DocumentNode.SelectNodes("/");
if (chartStrings != null)
foreach (HtmlNode i in chartStrings)
System.IO.File.WriteAllText("C:/Users/Admin/Desktop/WholeDocument.txt", i.OuterHtml);
else
MessageBox.Show("Error: Null item list.");
...以下 HTML 被写入文本文件。
http://pastebin.com/FALerBWR
当我将 C# 代码中的行更改为 HtmlNodeCollection chartStrings = hChart.DocumentNode.SelectNodes("/html/body");
时,正文中的 400 多行 HTML 将写入文本文件。
当我将行更改为HtmlNodeCollection chartStrings = hChart.DocumentNode.SelectNodes("/html/body/form");
时,只有一行代码(带有属性的表单开始标记)写入文本文件。它应该写很多行(大部分文档)。我相信HtmlAgilityPack
会因为 HTML 标签格式错误而感到困惑。有没有办法以编程方式解决这个问题?我不想每次运行程序时都手动更正 HTML!
【问题讨论】:
【参考方案1】:如果您认为是 html 错误导致的,请先整理 html。这是你可以使用的东西...... https://github.com/markbeaton/TidyManaged
【讨论】:
【参考方案2】:这是“设计使然”的行为。 FORM
默认情况下被视为empty
HTML 元素。原因在这里解释了 SO(检查我的答案):HtmlAgilityPack -- Does <form> close itself for some reason?
但这也是可配置的,你只需要指示解析器的行为不同,就像这样:
HtmlAgilityPack.HtmlDocument hChart = new HtmlAgilityPack.HtmlDocument();
// remove all specific behaviors for the `FORM` element
HtmlAgilityPack.HtmlNode.ElementsFlags.Remove("form");
hChart.LoadHtml(postRequest("http://www.entropedia.info/Chart.aspx?chart=Chart", "ctl00%24ContentPlaceHolder1%24DG1%24ctl19%24ctl05"));
HtmlNodeCollection chartStrings = hChart.DocumentNode.SelectNodes("/");
【讨论】:
以上是关于如何使用 XPath 挖掘(其他人的)编码不佳的 HTML?的主要内容,如果未能解决你的问题,请参考以下文章