如何使用 C# 在 Visual Studio 2010 中使用正则表达式或 HTMLAgilityPack 抓取 HTML 页面的特定部分?
Posted
技术标签:
【中文标题】如何使用 C# 在 Visual Studio 2010 中使用正则表达式或 HTMLAgilityPack 抓取 HTML 页面的特定部分?【英文标题】:How to scrape a particular part of a HTML page using regular expression or HTMLAgilityPack in visual studio 2010 using C#? 【发布时间】:2012-04-27 03:02:59 【问题描述】:我有 var source="<p><a href="http://in.news.yahoo.com/googles-stock-split-raises-questions-023232813.html"><img src="http://l.yimg.com/bt/api/res/1.2/TRLtYhdbTvFcX_GOU_0S4g--/YXBwaWQ9eW5ld3M7Zmk9ZmlsbDtoPTg2O3E9ODU7dz0xMzA-/http://media.zenfs.com/en_us/News/Reuters/2012-04-14T023232Z_5_CBRE83B1MAL00_RTROPTP_2_USA.JPG" width="130" height="86" alt="People visit Google's stand at the National Retail Federation Annual Convention and Expo in New York" align="left" title="People visit Google's stand at the National Retail Federation Annual Convention and Expo in New York" border="0" /></a>(Reuters) - An unusual stock split designed to preserve Google Inc founders' control of the Web search leader raised questions and some grumbling on Wall Street, even as investors focused on the company's short-term business concerns. Shares of Google closed 4 percent lower at $624.60 on Friday, driven by deepening worries about its search ad rates and payments to partners. The declining search trends underscored investor uncertainty about Google's growth prospects and unease about the company's pending $12.5 billion acquisition of Motorola Mobility. ...</p><br clear="all"/>"
现在我需要解析/scrape 以获取变量中的链接地址,即 http://in.news.yahoo.com/googles-stock-split-raises-questions-023232813.html
和单独变量中的图像 src。我还需要</a>
和</p>
之间的描述文本。请帮助我被严重卡住了...
【问题讨论】:
我是用 htmlagility 包做的...谢谢!!! 【参考方案1】:使用 HtmlAgilityPack 试试下面的代码 sn-p
var source = @"<p><a href=""http://in.news.yahoo.com/googles-stock-split-raises-questions-023232813.html""><img src=""http://l.yimg.com/bt/api/res/1.2/TRLtYhdbTvFcX_GOU_0S4g--/YXBwaWQ9eW5ld3M7Zmk9ZmlsbDtoPTg2O3E9ODU7dz0xMzA-/http://media.zenfs.com/en_us/News/Reuters/2012-04-14T023232Z_5_CBRE83B1MAL00_RTROPTP_2_USA.JPG"" 130"" 86"" People visit Google's stand at the National Retail Federation Annual Convention and Expo in New York"" align=""left"" title=""People visit Google's stand at the National Retail Federation Annual Convention and Expo in New York"" border=""0"" /></a>(Reuters) - An unusual stock split designed to preserve Google Inc founders' control of the Web search leader raised questions and some grumbling on Wall Street, even as investors focused on the company's short-term business concerns. Shares of Google closed 4 percent lower at $624.60 on Friday, driven by deepening worries about its search ad rates and payments to partners. The declining search trends underscored investor uncertainty about Google's growth prospects and unease about the company's pending $12.5 billion acquisition of Motorola Mobility. ...</p><br clear=""all""/>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);
var paraNode = doc.DocumentNode.SelectSingleNode("//p");
var desc = paraNode.InnerText;
var anchorNode = doc.DocumentNode.SelectSingleNode("//p/a");
var link = anchorNode.GetAttributeValue("href", null);
var imgNode = doc.DocumentNode.SelectSingleNode("//p/a/img");
var src = imgNode.GetAttributeValue("src", null);
有很多方法可以做到这一点,但这只是完成工作的方法之一。它为您提供了如何使用HtmlAgilityPack
的想法。 XPATH
将在解析此类内容时为您提供强大的功能。
【讨论】:
以上是关于如何使用 C# 在 Visual Studio 2010 中使用正则表达式或 HTMLAgilityPack 抓取 HTML 页面的特定部分?的主要内容,如果未能解决你的问题,请参考以下文章
如何在 Visual Studio 2017 中使用 C# 8?
如何在 Visual Studio 2013 中从 C# 解决方案生成序列图?
如何使用 C# 在 Visual Studio 中放置图像? [关闭]
如何从 Visual Studio C# 使用 Office?