用于在 HTML 标记中提取特定文本内容的正则表达式

Posted 2023-02-23

技术标签:

【中文标题】用于在 HTML 标记中提取特定文本内容的正则表达式【英文标题】：RegEx for extracting specific textContent in HTML tags 【发布时间】：2019-10-06 23:23:37 【问题描述】：

我需要创建一个 Python 程序，该程序从标准输入接收 html 文件，并使用 regext 将显示在 Mammals 下的物种名称逐行输出到标准输出。我也不需要输出显示为“#sequence_only”的项目。

用于标准输入的文件是这样的：

   <!DOCTYPE html>

  <!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>

我的逻辑如下。我想解析href的值。如果该行以

开头，href的值以“#”开头-->是物种名，我需要提取>

我尝试创建用于提取哺乳动物名称的正则表达式。

#!usr/bin/env python3

import sys
import re

html = sys.stdin.readlines()

for line in html:

    mammal_name = re.search(r'\"\>(.*?)\<', line)

if mammal_name:

    print(mammal_name.group())

我想要这样的输出：

Alpaca
Armadillo
Baboon

我得到如下输出：

">Human<
">Alpaca<
">Armadillo<
">Armadillo<
">Baboon<

我不希望 Human 出现在输出中，因为它所在的行不以开头。此外，我不希望在我的输出中重复，但为此我需要访问 href 的值，但我正在努力解决这部分问题。

更新：我的评分员向我显示这样的消息：“如果您在标签中包含物种名称，它将在许多浏览器中以斜体显示，所以想要以斜体显示科学名称的工作人员可能使用了标签。无论如何，它作为物种名称不合适，所以请删除它“。我猜它是关于 >(species name)

【问题讨论】：

不要使用正则表达式***.com/questions/1732348/…解析HTML 我已经阅读了这篇文章，但我需要使用正则表达式来完成我的作业。为什么你的班级教你用正则表达式解析 HTML？？ @DariObukhova，看看我的回答，应该对你有帮助。 @Olvin Rogth，非常感谢！此外，就我在提取物种名称之前从我的分级员那里得到的信息而言，我需要将它们之间的 > 【参考方案1】：

这里，我们只想添加两个左边界（<li><a.+?>）和右边界（<\/.+>），然后滑动我们想要的输出并将其保存在$1捕获组()中：

<li><a.+?>(.+)?<\/.+>

测试

# -*- coding: UTF-8 -*-
import re

string = """
!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>
"""
expression = r'<li><a.+?>(.+)?<\/.+>'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match ??? ")
else: 
    print('? Sorry! No matches!')

输出

YAAAY! "Alpaca" is a match ???

正则表达式

如果不需要此表达式，可以在 regex101.com 中修改或更改。

正则表达式电路

jex.im 也有助于将表达式可视化。

编辑：

要排除sequence_only，我们可以将表达式修改为：

<li.+?#[^s].+?>(.+)?<\/.+>

Demo

Python

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

test_str = '''

<!DOCTYPE html>

  <!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>

'''
regex = r"<li.+?#[^s].+?>(.+)?<\/.+>"
find_matches = re.findall(regex, test_str)
for matches in find_matches:
    print(matches)

输出

Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque

【讨论】：

非常感谢您的解释。有效。剩下的唯一问题是我不需要输出非常感谢。我得到了这个正则表达式的架构。很抱歉这么麻烦，但这也是一种替换 > 我已经更新了我的问题。分级机的输出确实很奇怪，但我的猜测是我需要这样做才能在某些浏览器中转义斜体输出...【参考方案2】：

你应该在你的正则表达式中添加一些细节来解析正确的字符串。 Regex test website.

输入：

string = '''   <!DOCTYPE html>

  <!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>'''

如果您想在一个表达式中处理所有文本，您应该使用findall()。 代码：

results = re.findall("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", string)
for s in results:
    print(s)

如果要逐行查看，可以使用search()。 代码：

strings = string.splitlines()
for s in strings:
    substring = re.search("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", s)
    if substring:
        print(substring.group(1))

输出：

Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque

【讨论】：

【参考方案3】：

使用re.findall 获取所有标签文本像这样

pattern = r'<li><a.*>(.*)</a>'
find = re.findall(pattern, string)
if find:
    print(find)

输出

['Alpaca', 'Armadillo', 'Armadillo', 'Baboon', 'Bison', 'Bonobo', 'Brown kiwi', 
'Bushbaby', 'Bushbaby', 'Cat', 'Chimpanzee', 'Chinese hamster', 'Chinese pangolin', 
'Cow', 'Crab-eating_macaque']

【讨论】：

【参考方案4】：

使用BeautifulSoup，它是一个强大的html解析包：

import re
import codecs

from bs4 import BeautifulSoup as soup
from lxml import html

# Change with your input file 
input_html = "D:\/input.html"

with codecs.open(input_html, 'r', "utf-8") as f :
    page = f.read()
f.close()
#html parsing
page_soup = soup(page, "html.parser")

#extract document seperator:
divTag = page_soup.find_all("div", "id": "mammals")

for tag in divTag:
    mammals = tag.find_all("a", href = re.compile(r'#(?!sequence_only$)'))
    for tag in mammals:
        print(tag.text)

输出：

Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque

【讨论】：

谢谢。我现在的问题是如何不输出 href value = #sequence_only 的名称。例如，犰狳和布什宝宝只需要输出一次.. 对不起，我忘记了这个标准，我只是编辑了它！快乐编码:)

以上是关于用于在 HTML 标记中提取特定文本内容的正则表达式的主要内容，如果未能解决你的问题，请参考以下文章