如何删除或替换两个模式之间的多行文本

Posted 2023-03-15

技术标签:

【中文标题】如何删除或替换两个模式之间的多行文本【英文标题】：How to remove or replace a multiline text between two patterns 【发布时间】：2021-07-02 06:21:34 【问题描述】：

我想在我的一些脚本中添加一些客户标志，以便在打包之前通过 shell 脚本对其进行解析。

比方说，删除所有的多行文本

^([#]|[//])0,1[_]+NOT_FOR_CUSTOMER_BEGIN[_]+\n

之间

^([#]|[//])0,1[_]+NOT_FOR_CUSTOMER_END[_]+\n

我希望它具有容错性（关于“_”的数量），这就是我使用正则表达式的原因。

例如：

before.foo

i want this
#____NOT_FOR_CUSTOMER_BEGIN________
not this
nor this
#________NOT_FOR_CUSTOMER_END____
and this
//____NOT_FOR_CUSTOMER_BEGIN__
not this again
nor this again
//__________NOT_FOR_CUSTOMER_END____
and this again

会变成：

after.foo

i want this
and this
and this again

我宁愿使用 sed，但欢迎任何聪明的解决方案 :)

类似这样的：

cat before.foo |  tr '\n' '\a' | sed -r 's/([#]|[//])0,1[_]+NOT_FOR_CUSTOMER_BEGIN[_]+\a.*\a([#]|[//])0,1[_]+NOT_FOR_CUSTOMER_END[_]+\a/\a/g' | tr '\a' '\n' > after.foo

【问题讨论】：

哪种工具/编程语言？ shell脚本，谢谢不是shell而是^(?:#|//)_+NOT_FOR_CUSTOMER_BEGIN_+(?:\s.+)*?\R(?:#|//)_+NOT_FOR_CUSTOMER_END_+\s*regex101.com/r/Qj2T59/1 它确实在工作，但我怎么称呼它？ 【参考方案1】：

sed 是处理此问题的最简单工具，因为它可以删除从开始模式到结束模式的行：

sed -E '/_+NOT_FOR_CUSTOMER_BEGIN_+/,/_+NOT_FOR_CUSTOMER_END_+/d' file

i want this
and this
and this again

如果您正在寻找awk 解决方案，那么这里有一个更简单的awk：

awk '/_+NOT_FOR_CUSTOMER_BEGIN_+/,/_+NOT_FOR_CUSTOMER_END_+/next 1' file

【讨论】：

最漂亮的解决方案。我知道 sed 可以完成这项工作:)【参考方案2】：

以这种方式获得awk 解决方案，并使用您展示的示例进行编写和测试。

awk '
/^([#]|[/][/])__+NOT_FOR_CUSTOMER_BEGIN/ found=1       
/^([#]|[/][/])__+NOT_FOR_CUSTOMER_END/   found=""; next
!found
'  Input_file

使用您显示的示例，输出将如下所示。

i want this
and this
and this again

解释： 简单的解释是：每当找到开始字符串（带有正则表达式）时，将标志设置为 TRUE（用于非打印）和每当结束字符串（带有正则表达式检查) 来取消标志以开始打印（取决于行）下一行。

【讨论】：

【参考方案3】：

您可以使用Python 脚本：

import re

data = """
i want this
#____NOT_FOR_CUSTOMER_BEGIN________
not this
nor this
#________NOT_FOR_CUSTOMER_END____
and this
//____NOT_FOR_CUSTOMER_BEGIN__
not this again
nor this again
//__________NOT_FOR_CUSTOMER_END____
and this again
"""

rx = re.compile(r'^(#|//)(?:.+\n)+^\1.+\n?', re.MULTILINE)
data = rx.sub('', data)
print(data)

这会产生

i want this
and this
and this again

见a demo on regex101.com。

【讨论】：

【参考方案4】：

您可以尽可能少地匹配从NOT_FOR_CUSTOMER_BEGIN_ 到NOT_FOR_CUSTOMER_END_ 的行

请注意，[//] 匹配单个 / 而不是 //

^(?:#|//)_+NOT_FOR_CUSTOMER_BEGIN_+(?:\n.*)*?\n(?:#|//)_+NOT_FOR_CUSTOMER_END_+\n*

^ 字符串开始 (?:#|//) 匹配 # 或 // _+NOT_FOR_CUSTOMER_BEGIN_+ 在 1 个或多个下划线之间匹配 NOT_FOR_CUSTOMER_BEGIN (?:\n.*)*? 尽可能少地重复行 \n(?:#|//)_+NOT_FOR_CUSTOMER_END_+ 匹配一个换行符，然后在一个或多个下划线之间匹配 # 或 // 和 NOT_FOR_CUSTOMER_END_ \n* 删除可选的尾随换行符

Regex demo

在 Python 中使用它的另一种方式：

import re

regex = r"^(?:#|//)_+NOT_FOR_CUSTOMER_BEGIN_+(?:\n.+)*?\n(?:#|//)_+NOT_FOR_CUSTOMER_END_+\n*"

s = ("i want this\n"
            "#____NOT_FOR_CUSTOMER_BEGIN________\n"
            "not this\n"
            "nor this\n"
            "#________NOT_FOR_CUSTOMER_END____\n"
            "and this\n"
            "//____NOT_FOR_CUSTOMER_BEGIN__\n"
            "not this again\n"
            "nor this again\n"
            "//__________NOT_FOR_CUSTOMER_END____\n"
            "and this again")

subst = ""
result = re.sub(regex, "", s, 0, re.MULTILINE)

if result:
    print (result)

输出

i want this
and this
and this again

【讨论】：

以上是关于如何删除或替换两个模式之间的多行文本的主要内容，如果未能解决你的问题，请参考以下文章