使用 C# 从 HTML 页面中提取一些内容及其对应的 Xpath
Posted
技术标签:
【中文标题】使用 C# 从 HTML 页面中提取一些内容及其对应的 Xpath【英文标题】:Using C# to extract some content and its correponding Xpath from an HTML page 【发布时间】:2019-08-31 04:39:21 【问题描述】:我有一个 html 文件,其内容如下:
</div><div class="\"more-detail-caption\"">More Numbers :</div><div id="\"moreHLNumbers\"" title="\"HSBC" bank="" helpline="" number\"="" class="\"more-detail-text\""><a href='tel:18605002277'>1860 500 2277 </a><a class='cchlOtherNoDescription'>( Credit Card - From India )</a><br><a href='tel:18602662667'>1860 266 2667 </a><a class='cchlOtherNoDescription'>( Personal Banking - From India )</a><br><a href='tel:18605002255'>1860 500 2255 </a><a class='cchlOtherNoDescription'>( Personal Banking - From India )</a><br><a href='tel:18004192266'>1800 419 2266 </a><a class='cchlOtherNoDescription'>( Corporate Cards - From India )</a><br><a href='tel:18001026922'>1800 102 6922 </a><a class='cchlOtherNoDescription'>( Corporate Cards - From India )</a><br><a href='tel:18002673456'>1800 267 3456 </a><a class='cchlOtherNoDescription'>( HSBC Advance - From India )</a><br><a href='tel:18001022208'>1800 102 2208 </a><a class='cchlOtherNoDescription'>( HSBC Advance - From India )</a><br><a href='tel:18002663456'>1800 266 3456 </a><a class='cchlOtherNoDescription'>( HSBC Premier - From India )</a><br><a href='tel:18001034722'>1800 103 4722 </a><a class='cchlOtherNoDescription'>( HSBC Premier - From India )</a><br><a href='tel:+912266800001'>022 66800001 </a><a class='cchlOtherNoDescription'>( Credit Card - From Overseas )
我想使用正则表达式及其描述来提取这些数字。例如: “1860 266 2667(个人银行业务 - 来自印度)”。除了它对应的xpath,使用c#。 到目前为止,我已经弄清楚了以下代码,它只是删除了额外的标签,并定义了用于提取数字的正则表达式。
using System.IO;
using System.Linq;
using HtmlAgilityPack;
using System.Text.RegularExpressions;
namespace ConsoleApp1
public class Program
private static string phoneReg = @"[\+]0,1(\d10,13|[\(][\+]0,1\d2,[\13)]*\d5,13|\d2,6[\-]1\d2,13[\-]*\d3,13)";
private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
public static void Main()
HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\htmldoc\htmlsample.html");
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style" || n.Name == "svg" || n.Name == "button"
|| n.Name == "li" || n.Name == "link" || n.Name == "img" || n.Name == "head" || n.Name == "header" || n.Name == "input")
.ToList()
.ForEach(n => n.Remove());
var phoneMatches = phoneRegex.Matches(doc.DocumentNode.InnerText);
File.WriteAllText(@"C:\htmldoc\new.html", doc.DocumentNode.InnerHtml.Replace(@"\t", ""));
但是,我也面临一些提取数字的问题。 有人可以帮我解决这个问题吗?
提前致谢。
【问题讨论】:
嗨,你为什么不想使用 html 解析器来完成像 Html 敏捷包这样的工作:html-agility-pack.net/?z=codeplex。对我来说这听起来容易多了? 我已经完成了,我需要描述以及电话号码。我正在使用 HTMLAgilitypack。 【参考方案1】:我不确定我的解决方案是否符合您的确切需求,但我认为应该很接近...
如果您愿意,可以使用在 MoreLinq 中定义的“ForEach”(而不是我的 ApplyForEachItem)。
作为参考,我使用https://regex101.com/ 进行我的reges 测试,这看起来很棒。
using System.IO;
using System.Linq;
using HtmlAgilityPack;
using System.Text.RegularExpressions;
using System.Diagnostics;
using System.Collections.Generic;
using System;
namespace SoQuestion
class Program
// private static string phoneReg = @"[\+]0,1(\d10,13|[\(][\+]0,1\d2,[\13)]*\d5,13|\d2,6[\-]1\d2,13[\-]*\d3,13)";
private static string phoneReg = @"\s+\d[ \d]+\r\n.+\r\n";
private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
public static void Main(string[] args)
HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\temp\HTMLPage1.html");
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style" || n.Name == "svg" || n.Name == "button"
|| n.Name == "li" || n.Name == "link" || n.Name == "img" || n.Name == "head" || n.Name == "header" || n.Name == "input")
.ToList()
.ForEach(n => n.Remove());
var phoneMatches = phoneRegex.Matches(doc.DocumentNode.InnerText);
List<Tuple<string, string>> data = new List<Tuple<string, string>>();
ApplyForEachItem(phoneMatches, match =>
int indexFirstDigit = match.Value.IndexOfAny(new char[]'1', '2', '3', '4', '5', '6', '7', '8', '9', '0' );
string[] phoneAndDesc = match.Value.Substring(indexFirstDigit).Split("\r\n");
data.Add(new Tuple<string, string>(phoneAndDesc[0].Trim(), phoneAndDesc[1].Trim()));
);
ApplyForEachItem(data, item => Debug.Print($"Phone: 'item.Item1', Desc = 'item.Item2' \r\n"));
public static void ApplyForEachItem<T>(IEnumerable<T> enumerable, Action<T> action)
if (enumerable == null)
return;
foreach (T t in enumerable)
action(t);
结果:
Phone: '1860 500 2277', Desc = '( Credit Card - From India )'
Phone: '1860 266 2667', Desc = '( Personal Banking - From India )'
Phone: '1860 500 2255', Desc = '( Personal Banking - From India )'
Phone: '1800 419 2266', Desc = '( Corporate Cards - From India )'
Phone: '1800 102 6922', Desc = '( Corporate Cards - From India )'
Phone: '1800 267 3456', Desc = '( HSBC Advance - From India )'
Phone: '1800 102 2208', Desc = '( HSBC Advance - From India )'
Phone: '1800 266 3456', Desc = '( HSBC Premier - From India )'
Phone: '1800 103 4722', Desc = '( HSBC Premier - From India )'
Phone: '022 66800001', Desc = '( Credit Card - From Overseas )'
【讨论】:
提取描述和数字。 (例如 1860 266 2667:- 个人银行业务 - 来自印度) 也得到错误“无法从用法中推断出来,尝试明确指定类型参数 应该是更接近您需求的解决方案。我想知道..您似乎试图窃取或破解...您不这样做吗? 关于无法推断,你用的是什么版本的Visual Studio和Frameowrk .net? 我不是在偷窃或黑客攻击。别担心。 :)以上是关于使用 C# 从 HTML 页面中提取一些内容及其对应的 Xpath的主要内容,如果未能解决你的问题,请参考以下文章
从 Java 中的 GraphQL 查询中提取查询名称及其字段