正则表达式 - 从 HTML 文件中查找各种字符串 [重复]

Posted 2023-02-24

技术标签:

【中文标题】正则表达式 - 从 HTML 文件中查找各种字符串 [重复]【英文标题】：Regex - find various strings from an HTML file [duplicate] 【发布时间】：2012-03-18 02:16:30 【问题描述】：

我有一个名为 basic.html 的 html 文件，我的任务是创建一个使用正则表达式输出各种字符串的小型 Java 程序。我的程序应该显示以下每个字符串的所有出现的行号：

div 标签 div class="menuItem" 标签跨度标签 class="emph" 任何以结尾的字符串，即所有标签。 body 标签的内容。所有div的内容制作菜单的所有 div

我还必须使用 start 和 end 方法来显示索引值。

我的代码开始如下：

import java.io.IOException;
import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexHTML 
   public static void main(String[] args) throws IOException 

      // Input for matching the regexe pattern
       String file_name = "basic.html";

           ReadFile file = new ReadFile(file_name);
           String[] aryLines = file.OpenFile();  
           String asString = Arrays.toString(aryLines);

            // Regexe to be matched
               String regexe = "<div>";

           int i;
           for ( i=0; i < aryLines.length; i++ ) 
           System.out.println( aryLines[ i ] ) ;
           



      // Step 1: Allocate a Pattern object to compile a regexe
      Pattern pattern = Pattern.compile(regexe);
      //Pattern pattern = Pattern.compile(regexe, Pattern.CASE_INSENSITIVE);  // case-    insensitive matching

      // Step 2: Allocate a Matcher object from the compiled regexe pattern,
      //         and provide the input to the Matcher
      Matcher matcher = pattern.matcher(asString);

      // Step 3: Perform the matching and process the matching result
      int count = 0;
      // Use method find()
      while (matcher.find())      // find the next match
         System.out.println("find() found the pattern \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
          count++;

      
      System.out.println("\nFound the pattern "+count+ " times.\n");

      // Use method matches()
      if (matcher.matches()) 
         System.out.println("matches() found the pattern \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
       else 
         System.out.println("matches() found nothing");
      

      // Use method lookingAt()
      if (matcher.lookingAt()) 
         System.out.println("lookingAt() found the pattern \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
       else 
         System.out.println("lookingAt() found nothing");

我最大的问题是我将如何准确地显示所有这些事件，到目前为止，我的代码只给了我 div 标签的索引值，但我希望上面列出的所有事件都显示在输出中. 我的第二个问题当然是如何显示每个字符串出现的行，但我还没有真正研究过这个，因为我现在正在考虑第一个问题。但是，如果您也可以给我一个关于从哪里开始的提示，我将不胜感激。

【问题讨论】：

HTML 不是用正则表达式解析的正则语言。 ***.com/questions/590747/… 我假设你还没有阅读我的代码对你有什么... 您想做的某些事情可以使用正则表达式来完成。对于其他人来说，正则表达式不是正确的工具。当有嵌套的 div 标签时，The contents of all divs 是困难的/不可能的。为此使用解析器。任意约束（即使用RegEx）提示我问：这是homework吗？ 【参考方案1】：

一种方法是将每个正则表达式应用于String[] aryLines 单独的行。行号是索引。

如果您要查找的短语跨越多行，您会怎么做？这在 HTML 中是有效的......另外，让我成为第一个说 regex 在一般情况下不会解决这个问题的人。

【讨论】：

【参考方案2】：

您真的不应该使用正则表达式来解析 HTML，请尝试现有的库，例如 JSoup。我敢肯定，您不会花时间重新发明 HTML 解析！

【讨论】：

以上是关于正则表达式 - 从 HTML 文件中查找各种字符串 [重复]的主要内容，如果未能解决你的问题，请参考以下文章