使用 C# 从 HTML 页面中提取一些内容及其对应的 Xpath

Posted 2023-04-13

技术标签:

【中文标题】使用 C# 从 HTML 页面中提取一些内容及其对应的 Xpath【英文标题】：Using C# to extract some content and its correponding Xpath from an HTML page 【发布时间】：2019-08-31 04:39:21 【问题描述】：

我有一个 html 文件，其内容如下：

</div><div class="\"more-detail-caption\"">More Numbers :</div><div id="\"moreHLNumbers\"" title="\"HSBC" bank="" helpline="" number\"="" class="\"more-detail-text\""><a href='tel:18605002277'>1860 500 2277 </a><a class='cchlOtherNoDescription'>( Credit Card - From India )</a><br><a href='tel:18602662667'>1860 266 2667 </a><a class='cchlOtherNoDescription'>( Personal Banking - From India )</a><br><a href='tel:18605002255'>1860 500 2255 </a><a class='cchlOtherNoDescription'>( Personal Banking - From India )</a><br><a href='tel:18004192266'>1800 419 2266 </a><a class='cchlOtherNoDescription'>( Corporate Cards - From India )</a><br><a href='tel:18001026922'>1800 102 6922 </a><a class='cchlOtherNoDescription'>( Corporate Cards - From India )</a><br><a href='tel:18002673456'>1800 267 3456 </a><a class='cchlOtherNoDescription'>( HSBC Advance - From India )</a><br><a href='tel:18001022208'>1800 102 2208 </a><a class='cchlOtherNoDescription'>( HSBC Advance - From India )</a><br><a href='tel:18002663456'>1800 266 3456 </a><a class='cchlOtherNoDescription'>( HSBC Premier - From India )</a><br><a href='tel:18001034722'>1800 103 4722 </a><a class='cchlOtherNoDescription'>( HSBC Premier - From India )</a><br><a href='tel:+912266800001'>022 66800001 </a><a class='cchlOtherNoDescription'>( Credit Card - From Overseas )

我想使用正则表达式及其描述来提取这些数字。例如： “1860 266 2667（个人银行业务 - 来自印度）”。除了它对应的xpath，使用c#。到目前为止，我已经弄清楚了以下代码，它只是删除了额外的标签，并定义了用于提取数字的正则表达式。

    using System.IO;
using System.Linq;
using HtmlAgilityPack;
using System.Text.RegularExpressions;

namespace ConsoleApp1


    public class Program
    

        private static string phoneReg = @"[\+]0,1(\d10,13|[\(][\+]0,1\d2,[\13)]*\d5,13|\d2,6[\-]1\d2,13[\-]*\d3,13)";
        private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
        public static void Main()
        

            HtmlDocument doc = new HtmlDocument();
            doc.Load(@"C:\htmldoc\htmlsample.html");
            doc.DocumentNode.Descendants()
                            .Where(n => n.Name == "script" || n.Name == "style" || n.Name == "svg" || n.Name == "button"
                                  || n.Name == "li" || n.Name == "link" || n.Name == "img" || n.Name == "head" || n.Name == "header" || n.Name == "input")
                            .ToList()
                            .ForEach(n => n.Remove());
            var phoneMatches = phoneRegex.Matches(doc.DocumentNode.InnerText);
            File.WriteAllText(@"C:\htmldoc\new.html", doc.DocumentNode.InnerHtml.Replace(@"\t", ""));

但是，我也面临一些提取数字的问题。有人可以帮我解决这个问题吗？

提前致谢。

【问题讨论】：

嗨，你为什么不想使用 html 解析器来完成像 Html 敏捷包这样的工作：html-agility-pack.net/?z=codeplex。对我来说这听起来容易多了？我已经完成了，我需要描述以及电话号码。我正在使用 HTMLAgilitypack。 【参考方案1】：

我不确定我的解决方案是否符合您的确切需求，但我认为应该很接近...

如果您愿意，可以使用在 MoreLinq 中定义的“ForEach”（而不是我的 ApplyForEachItem）。

作为参考，我使用https://regex101.com/ 进行我的reges 测试，这看起来很棒。

using System.IO;
using System.Linq;
using HtmlAgilityPack;
using System.Text.RegularExpressions;
using System.Diagnostics;
using System.Collections.Generic;
using System;

namespace SoQuestion

    class Program
    
        // private static string phoneReg = @"[\+]0,1(\d10,13|[\(][\+]0,1\d2,[\13)]*\d5,13|\d2,6[\-]1\d2,13[\-]*\d3,13)";

        private static string phoneReg = @"\s+\d[ \d]+\r\n.+\r\n";

        private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
        public static void Main(string[] args)
        
            HtmlDocument doc = new HtmlDocument();
            doc.Load(@"C:\temp\HTMLPage1.html");
            doc.DocumentNode.Descendants()
                            .Where(n => n.Name == "script" || n.Name == "style" || n.Name == "svg" || n.Name == "button"
                                  || n.Name == "li" || n.Name == "link" || n.Name == "img" || n.Name == "head" || n.Name == "header" || n.Name == "input")
                            .ToList()
                            .ForEach(n => n.Remove());
            var phoneMatches = phoneRegex.Matches(doc.DocumentNode.InnerText);

            List<Tuple<string, string>> data = new List<Tuple<string, string>>();

            ApplyForEachItem(phoneMatches, match =>
            
                int indexFirstDigit = match.Value.IndexOfAny(new char[]'1', '2', '3', '4', '5', '6', '7', '8', '9', '0' );

                string[] phoneAndDesc = match.Value.Substring(indexFirstDigit).Split("\r\n");
                data.Add(new Tuple<string, string>(phoneAndDesc[0].Trim(), phoneAndDesc[1].Trim()));
            );

            ApplyForEachItem(data, item => Debug.Print($"Phone: 'item.Item1', Desc = 'item.Item2' \r\n"));
        

        public static void ApplyForEachItem<T>(IEnumerable<T> enumerable, Action<T> action)
        
            if (enumerable == null)
            
                return;
            

            foreach (T t in enumerable)
            
                action(t);

结果：

Phone: '1860 500 2277', Desc = '( Credit Card - From India )' 
Phone: '1860 266 2667', Desc = '( Personal Banking - From India )' 
Phone: '1860 500 2255', Desc = '( Personal Banking - From India )' 
Phone: '1800 419 2266', Desc = '( Corporate Cards - From India )' 
Phone: '1800 102 6922', Desc = '( Corporate Cards - From India )' 
Phone: '1800 267 3456', Desc = '( HSBC Advance - From India )' 
Phone: '1800 102 2208', Desc = '( HSBC Advance - From India )' 
Phone: '1800 266 3456', Desc = '( HSBC Premier - From India )' 
Phone: '1800 103 4722', Desc = '( HSBC Premier - From India )' 
Phone: '022 66800001', Desc = '( Credit Card - From Overseas )'

【讨论】：

提取描述和数字。（例如 1860 266 2667：- 个人银行业务 - 来自印度）也得到错误“无法从用法中推断出来，尝试明确指定类型参数应该是更接近您需求的解决方案。我想知道..您似乎试图窃取或破解...您不这样做吗？关于无法推断，你用的是什么版本的Visual Studio和Frameowrk .net？我不是在偷窃或黑客攻击。别担心。 :)

以上是关于使用 C# 从 HTML 页面中提取一些内容及其对应的 Xpath的主要内容，如果未能解决你的问题，请参考以下文章