sed 将 // 替换为 /* / 注释，除非// 注释出现在 / */ 中

Posted 2023-02-21

技术标签:

【中文标题】sed 将 // 替换为 /* */ 注释，除非// 注释出现在 /* */ 中【英文标题】：sed to replace // with /* */ comments EXCEPT when // comments appear within /* */ 【发布时间】：2012-08-13 13:26:53 【问题描述】：

我面临的问题是需要 C 风格 cmets 的 ANSI 编译器。

所以我正在尝试转换我现有的 cmets 以符合 C 标准 ISO C89。

我正在寻找一个 SED 表达式来替换 // cmets 为 /* cmets 除非 // cmets 出现在 /* */ cmets 中（这会破坏注释）。

我试过这个（范围表达式）但无济于事：

sed -e '/\/*/,/*\//! s_//\(.*\)_/*\1 */_' > filename

是否可以忽略这样的评论中的 1 行 cmets 但更改其他所有内容？

/**********************************
* Some comment
* an example bit of code within the comment followed by a //comment
* some more comment
***********************************/
y = x+7; //this comment must be changed

谢谢！

【问题讨论】：

正则表达式是不够的。你可以通过与 C99 兼容的编译器的预处理器（例如 cc -E）来传递所有内容吗？ 【参考方案1】：

这是一个用 C 语言编写的经过简单测试的过滤器，它应该执行您想要的转换。一些关于这个过滤器做什么的 cmets 很难用正则表达式来处理：

它会忽略用引号括起来的类似注释的序列（因为它们不是 cmets）如果正在转换的 C99 注释包含将开始或结束 C89 注释的内容，它会调整该序列，因此不会有嵌套注释或过早结束注释（嵌套的 /* 或 */更改为 /+ 或 /|)。我不确定你是否需要这个（如果你不需要，它应该很容易删除）上述嵌套 cmets 的修改仅发生在正在转换的 C99 注释中 - 已经是 C89 样式的 cmets 的内容不会更改。它不处理三合字母或二合字母（我认为这只允许丢失由三合字母 ??/ 启动的转义序列或行尾延续的可能性）。

当然，您需要执行自己的测试以确定它是否适合您的目的。

#include <stdio.h>

char* a = " this is /* a test of \" junk // embedded in a '\' string";
char* b = "it should be left alone//";

// comment /* that should ***////  be converted.
/* leave this alone*/// but fix this one

// and "leave these \' \" quotes in a comment alone*
/****  and these '\' too //
*/


enum states 
    state_normal,
    state_double_quote,
    state_single_quote,
    state_c89_comment,
    state_c99_comment
;

enum states current_state = state_normal;

void handle_char( char ch)

    static char last_ch = 0;

    switch (current_state) 
        case state_normal:
            if ((last_ch == '/') && (ch == '/')) 
                putchar( '*');  /* NOTE: changing to C89 style comment */
                current_state = state_c99_comment;
            
            else if ((last_ch == '/') && (ch == '*')) 
                putchar( ch);
                current_state = state_c89_comment;
            
            else if (ch == '\"') 
                putchar( ch);
                current_state = state_double_quote;
            
            else if (ch == '\'') 
                putchar( ch);
                current_state = state_single_quote;
            
            else 
                putchar( ch);
            
            break;

        case state_double_quote:
            if ((last_ch == '\\') && (ch == '\\')) 
                /* we want to output this \\ escaped sequence, but we */
                /* don't want to 'remember' the current backslash -   */
                /* otherwise we'll mistakenly treat the next character*/
                /* as being escaped                                   */

                putchar( ch);
                ch = 0;
            
            else if ((ch == '\"') && (last_ch != '\\')) 
                putchar( ch);
                current_state = state_normal;
            
            else 
                putchar( ch);
            
            break;

        case state_single_quote:
            if ((last_ch == '\\') && (ch == '\\')) 
                /* we want to output this \\ escaped sequence, but we */
                /* don't want to 'remember' the current backslash -   */
                /* otherwise we'll mistakenly treat the next character*/
                /* as being escaped                                   */

                putchar( ch);
                ch = 0;
            
            else if ((ch == '\'') && (last_ch != '\\')) 
                putchar( ch);
                current_state = state_normal;
            
            else 
                putchar( ch);
            
            break;

        case state_c89_comment:
            if ((last_ch == '*') && (ch == '/')) 
                putchar( ch);
                ch = 0; /* 'forget' the slash so it doesn't affect a possible slash that immediately follows */
                current_state = state_normal;
            
            else 
                putchar( ch);
            
            break;

        case state_c99_comment:
            if ((last_ch == '/') && (ch == '*')) 
                /* we want to change any slash-star sequences inside */
                /* what was a C99 comment to something else to avoid */
                /* nested comments                                   */
                putchar( '+');
            
            else if ((last_ch == '*') && (ch == '/')) 
                /* similarly for star-slash sequences inside */
                /* what was a C99 comment                    */
                putchar( '|');
            
            else if (ch == '\n') 
                puts( "*/");
                current_state = state_normal;
            
            else 
                putchar( ch);
            
            break;
    

    last_ch = ch;


int main(void)

    int c;

    while ((c = getchar()) != EOF) 
        handle_char( c);
    

    return 0;

一些放纵的评论：很多年前，我工作的一家商店想要强加一个禁止 C99 风格的 cmets 的编码标准，理由是即使我们当时使用的编译器没有问题，但代码可能必须移植到不支持它们的编译器。我（和其他人）成功地论证了这种可能性是如此遥远以至于基本上不存在，并且即使它确实发生了，也可以很容易地编写一个使 cmets 兼容的转换例程。我们被允许使用 C99/C++ 风格的 cmets。

我现在认为我的誓言已经兑现，任何可能对我施加的诅咒都将被解除。

【讨论】：

理论上，您可以在注释开始或结束序列的中间使用反斜杠换行符组合。幸运的是，在实践中，您不必担心它们；您只需解雇编写评论的程序员就可以开始拆分多行。我会说你履行了你的誓言仅供参考此例程首选 UNIX 行结尾。（起初让我着迷）【参考方案2】：

如果您不能使用@ephemient 的建议，那么您需要在多行中应用您的正则表达式，这不是 sed 的默认行为。 sed 有一个保持缓冲区，它允许您将多个字符串附加在一起并将正则表达式应用于连接的字符串。

sed 表达式如下所示：

sed '1h;1!H;$;g;s/your-matcher-regex/replacement-regex/g;'

1h - 如果是第一行，则将该行放入保持缓冲区（先清空）

1!H - 如果不是第一行，则追加到保持缓冲区

$ ... - 如果是最后一行，执行这个 sed 命令

现在，即使 /* 和 */ 在不同的行上，您的匹配器表达式也可以工作。

【讨论】：

【参考方案3】：

awk 'if($0~/\/\//)sub(/\/\//,"\/\*");$0=$0"*/";print' temp

【讨论】：

【参考方案4】：

使用可以为 /* 和 // cmets 输出不同标记的任何转换器将代码转换为彩色 html，使用 perl/awk/sed/whatever 处理输出，然后去除标记。

【讨论】：

【参考方案5】：

您可以（几乎）完全在 sed 中执行此操作，您只需拨打tr：

translate_cmets_prepare.sed

s/\\/\\\\/g  # escape current escape characters
s/\$/\\S/g   # write all occurrences of $ as \S
s/(/\\o/g    # replace open braces with \o
s/)/\\c/g    # replace closing braces with \c
s/$/$/       # add a $ sign to the end of each line
s_/\*_(_g    # replace the start of comments with (
s_\*/_)_g    # replace the end of comments with )

然后，我们通过tr -d '\n' 将“预处理”步骤的结果通过管道连接到所有行（我还没有找到从sed 中执行此操作的好方法）。

然后我们做真正的工作：

translate_cmets.sed

s_//\([^$]*\)\$_(\1)$_g  # replace all C++ style comments (even nested ones)
:b                       # while loop
                         # remove nested comment blocks:
                         #   (foo(bar)baz) --> (foobarbaz)
s/(\([^()]*\)(\([^()]*\))\([^()]*\))/(\1\2\3)/
tb                       # EOF loop
s_(_/*_g                 # reverse the steps done by the preparation phase
s_)_*/_g                 # ...
s/\$/\n/g                # split lines that were previously joined
s/\\S/$/g                # replace escaped special characters
s/\\o/(/g                # ...
s/\\c/)/g                # ...
s/\\\(.\)/\1/g           # ...

然后我们基本上把所有东西放在一起

sed -f translate_comments_prepare.sed | tr -d '\n' | sed translate_comments.sed

【讨论】：

【参考方案6】：

这可能对你有用（GNU sed）：

sed ':a;$!N;ba;s/^/\x00/;tb;:b;s/\x00$//;t;s/\x00\(\/\*[^*]*\*\+\([^/*][^*]*\*\+\)*\/\)/\1\x00/;tb;s/\x00\/\/\([^\n]*\)/\/*\1\*\/\x00/;tb;s/\x00\(.\)/\1\x00/;tb' file

解释：

:a;$!N;ba 将文件 slurp 到模式空间中 s/^/\x00/ 设置标记 N.B.这可以是文件中没有的任何字符 tb;:b 通过跳转到占位符b 重置替换开关 s/\x00$//;t 标记已到达文件末尾。全部完成。 s/\x00$\/\*[^*]*\*\+\([^/*][^*]*\*\+$*\/\)/\1\x00/;tb 此正则表达式匹配 c 样式的 cmets，如果为 true，则标记通过它们。 s/\x00\/\/$[^\n]*$/\/*\1\*\/\x00/;tb 此正则表达式匹配单行注释，替换为 c 样式的 cmets，如果为 true，则标记通过它们。 s/\x00$.$/\1\x00/;tb 此正则表达式匹配任何单个字符，如果为真，则与通过它的标记碰撞。

【讨论】：

以上是关于sed 将 // 替换为 /* */ 注释，除非// 注释出现在 /* */ 中的主要内容，如果未能解决你的问题，请参考以下文章