读取 C 源文件并跳过 /**/ 注释

Posted 2023-02-21

技术标签:

【中文标题】读取 C 源文件并跳过 /**/ 注释【英文标题】：Reading a C source file and skipping /**/ comments 【发布时间】：2015-03-07 00:23:25 【问题描述】：

我设法在 C 源代码中编写了跳过 // cmets 的代码：

while (fgets(string, 10000, fin) != NULL)

    unsigned int i;
    for (i = 0; i < strlen(string); i++)
    
        if ((string[i] == '/') && (string[i + 1] == '/'))
        
            while (string[i += 1] != '\n')
                continue;
         
    //rest of the code...

我尝试为/**/ cmets 做类似的事情：

if ((string[i] == '/') && (string[i + 1] == '*'))

    while (string[i += 1] != '/')
        continue;


if ((string[i] == '*') && (string[i + 1] == '/'))

    while (string[i -= 1])
        continue;

但它会逐行读取，例如，如果我有，

/*

text*/

然后它计算文本。

我该如何解决这个问题？

【问题讨论】：

将状态保存到一个变量中并在接下来的迭代中对其进行测试。 string[i += 1] 表示法是写string[i++] 的一种传统方式。此外，换行符的测试毫无意义。 fgets() 读了一行，但只有一行，所以注释一直持续到字符串的末尾。我不会对您的代码无法处理的所有特殊情况感到厌烦（"/* not a comment */"、"// not a comment"、行尾的反斜杠、三元组等）。关于这个主题还有其他（多个其他）问题。找到一个好的复制这个也将更难。 C 预处理器将正确剥离所有 cmets。我有一个使用 GCC 的 C 预处理器来删除 cmets 的 shell 脚本，但它也会重新格式化程序。关于此主题的其他问题，请参阅：Remove comments from C/C++ code 和 Python snippet to remove C and C++ comments。第二个概述了生产强度代码需要处理的一些问题。只是为了您的娱乐（或者我的意思是“焦虑”），我发现了一个新的可怕技巧，即“这不是评论，即使它看起来有点像评论”。 #include <./*some*/header.h> 包括来自目录 ./*some* 的文件 header.h（至少在 Mac OS X 10.10.1 上使用 GCC 4.9.1）。更糟糕的是#include <./*some/header.h>，它将在目录./*some 中查找header.h。两者都倾向于将幼稚的 C 注释解析器发送到错误的轨道上。您还应该警惕不包含 C++ 风格注释的 #include <some//header.h>。我要对我的代码进行一些修复工作！ 【参考方案1】：

即使你的所谓工作代码也有几个问题：

/* ... */

//

归根结底，C 是面向流的语言，而不是面向行的语言。应该以这种方式解析（逐个字符）。要正确完成这项工作，您确实需要实现一个更复杂的解析器。如果您准备学习一种新工具，那么您可以考虑将您的程序基于 Flex 词法分析器。

【讨论】：

只剥离 cmets 他不需要完整的 C 解析器。实际上，cmets 通常在预处理器阶段被剥离。 @LuisColorado：不，他不需要完整的 C 解析器。我没说他有。不过，他确实确实需要一些复杂的东西：它需要能够识别足够多的 C 句法结构，以便能够判断注释分隔符何时起作用，何时不起作用。【参考方案2】：

一个简单的 C 注释正则表达式是：

/\*([^\*]|\*[^\/])*\*\//

（对不起，转义字符）这允许注释中的任何序列，除了*/。它转换为以下 DFA（四种状态）：

状态0，输入/，下一个状态1，无输出状态0，输入其他，下一个状态0，输出读取字符状态1，输入*，下一个状态2，无输出状态1，输入/，下一个状态1，输出/ 状态1，输入其他，下一个状态0，输出/并读取字符状态2，输入*，下一个状态3，无输出状态 2，输入其他，下一个状态 3，无输出状态3，输入/，下一个状态0，无输出状态3，输入*，下一个状态3，无输出状态 3，输入其他，下一个状态 2，无输出

可能的输入是/、* 和任何其他字符。可能的输出是输出读取字符、输出/和输出*。

这转换为以下代码：

文件取消注释.c:

#include <stdio.h>

int main()

    int c, st = 0;
    while ((c = getchar()) != EOF) 
        switch (st) 
        case 0: /* initial state */
            switch (c) 
            case '/': st = 1; break;
            default: putchar(c); break;
             /* switch */
            break;
        case 1: /* we have read "/" */
            switch (c) 
            case '/': putchar('/'); break;
            case '*': st = 2; break;
            default: putchar('/'); putchar(c); st = 0; break;
             /* switch */
            break;
        case 2: /* we have read "/*" */
            switch (c) 
            case '*': st = 3; break;
            default: break;
             /* switch */
            break;
        case 3: /* we have read "/* ... *" */
            switch (c) 
            case '/': st = 0; break;
            case '*': break;
            default: st = 2; break;
             /* switch */
            break;
         /* switch */
     /* while */
 /* main */

如果要排除这两种类型的cmet，我们需要在接收到第二个/时切换到第五个状态，产生如下代码：

文件取消注释2.c：

#include <stdio.h>

int main()

    int c, st = 0;
    while ((c = getchar()) != EOF) 
        switch (st) 
        case 0: /* initial state */
            switch (c) 
            case '/': st = 1; break;
            default: putchar(c); break;
             /* switch */
            break;
        case 1: /* we have read "/" */
            switch (c) 
            case '/': st = 4; break;
            case '*': st = 2; break;
            default: putchar('/'); putchar(c); st = 0; break;
             /* switch */
            break;
        case 2: /* we have read "/*" */
            switch (c) 
            case '*': st = 3; break;
            default: break;
             /* switch */
            break;
        case 3: /* we have read "/* ... *" */
            switch (c) 
            case '/': st = 0; break;
            case '*': break;
            default: st = 2; break;
             /* switch */
            break;
        // in the next line we put // inside an `old' comment
        // to illustrate this special case.  The switch has been put
        // after the comment to show it is not being commented out.
        case 4: /* we have read "// ..." */ switch(c) 
            case '\n': st = 0; putchar('\n'); break;
             // switch  (to illustrate this kind of comment).
         /* switch */
     /* while */
 /* main */

【讨论】：

是的，非常好。但是如果注释分隔符出现在字符串文字中：puts("/* ... */") 怎么办？还是在多字符字符文字内？ (Ew.) 无论如何，您已经提出了与我相同的观点：需要逐个字符地解析源，并且解析需要比仅扫描分隔符更复杂。跨度> 您最后列出的状态“状态 3，输入其他，下一个状态 3，无输出”应该是“状态 3，输入其他，下一个状态 2，无输出”，不是吗？否则，它会过早地终止注释，例如/* any * thing / goes */（因为它记得它找到了*，然后当它得到/ 时，它终止了注释）。而且，确实，您的代码实现了最后一个状态的更正版本，因此我编辑了指定的 DFA 以匹配实现的内容。 @JonathanLeffler，感谢您的编辑。幸运的是，代码没问题。我在发布之前检查了代码，但无法对文本做同样的事情。对不起。 @JohnBollinger，你是完全正确的，我们必须检查 " 分隔的字符串。在常量字符文字的情况下，恐怕/*、*/ 和 @987654344 都没有@ 序列允许作为字符常量。字符串的情况很复杂，因为我们还必须处理其中的转义\"。无论哪种情况，自动机都不太复杂，可以从中推导出来作为读者的练习:)【参考方案3】：

这个简单的代码可以忽略注释/* */（不处理所有情况，例如在c代码中变量的引号之间的字符串中写入/*）

#include <stdio.h> 
#include <string.h> 

typedef enum bool // false = 0 and true = 1
 false,truebool;
int main(int argc, char *argv[])

     FILE* file=fopen("file","r"); // open the file 
     bool comment=false;
     char str[1001]; // string that will contain portion of the file each time     

     if (file!=NULL)
     
         while (fgets(str,sizeof(str),file)!=NULL)
         
             int i=0;
             for (i=0;i<strlen(str);i++)
             
                 if (str[i]=='/' && str[i+1] == '*')
                 
                     comment=true; // comment true we will ignore till the end of comment
                     i++; // skip the * character 
                 
                 else if (str[i]=='*' && str[i+1] == '/')
                 
                     comment=false; 
                     i++; // skip the / character
                 
                 else if (comment==false)
                 
                     printf("%c",str[i]); // if the character not inside comment print it
                 
             
         
         fclose(file);
     

     return 0;

【讨论】：

“不处理所有情况” - 哪些情况？注意你应该使用sizeof(str)作为fgets()的参数，并且它已经知道如果你指定1001作为大小（通过sizeof(str)），那么它必须使用最后一个字节用于终止空字节。 @WeatherVane：除其他外，它不处理字符串文字（或多字符文字）中的注释开始字符。 @JonathanLeffler 我希望 Meninx 能解释一下。 @WeatherVane 在编写代码时我并没有诚实地意识到这种情况，但在阅读了 John Bollinger 的回答后，我意识到需要处理的情况太多了，特别是如果文件包含一个复杂的 C 代码 :) ！谢谢你和乔纳森·莱弗！【参考方案4】：

（目前还不清楚您的程序要做什么。）

使用flex统计cmets外的字符数：

%option noyywrap

%%
   int i = 0;

\"([^\\"]|\\.)*\"           i += yyleng ;        // treatment of strings
\/\/.*                                           // C++ comments
\/\*([^*]|\*[^/])*\*\/                           // C  comments
.|\n                        i += yyleng ;        // normal chars

<<EOF>>                     printf("%d\n",i); return;
%%

int main() 
  yylex(); 
  return 0;

和

$ flex count-non-com.fl
$ cc -o count-non-com lex.yy.c
$ count-non-com < input

最后一个例子：删除 cmets 的弹性代码（感谢@LuisColorado）

%option noyywrap 
%%

\"([^\\"]|\\.)*\"           ECHO;        // treatment of strings
\/\/.*                                   // C++ comments
\/\*([^*]|\*[^/])*\*\/                   // C  comments
.|\n                        ECHO;        // normal chars

%%

int main() 
  yylex(); 
  return 0;

【讨论】：

@LuisColorado，谢谢！如果我理解正确，您编辑了我的代码，但版本被拒绝。我现在看到了它，它有一些很好的贡献。我试图调和这两个版本。【参考方案5】：

创建一个 int 变量。如果得到 /*，则扫描字符并存储索引。继续扫描，直到得到 */. 如果当时变量 !=0，则假定这是结束注释标记并忽略其间的字符。

【讨论】：

【参考方案6】：

正如user279599刚才所说，使用一个整数变量作为标志，只要你得到'/'和''连续设置标志（标志=1），那么标志值保持1直到得到'' & '/' 连续。标志为 1 时忽略每个字符。

【讨论】：

以上是关于读取 C 源文件并跳过 /**/ 注释的主要内容，如果未能解决你的问题，请参考以下文章