使用 C/C++ 有效地反序列化由浮点数、标记和空行组成的字符串

Posted 2023-02-17

技术标签:

【中文标题】使用 C/C++ 有效地反序列化由浮点数、标记和空行组成的字符串【英文标题】：Using C/C++ to efficiently de-serialize a string comprised of floats, tokens and blank lines 【发布时间】：2010-01-14 04:57:54 【问题描述】：

我有类似以下的大字符串...

some_text_token 24.325973 -20.638823 -1.964366 0.753947 -1.290811 -3.547422 0.813014 -3.547227 0.472015 3.723311 -0.719116 3.676793 other_text_token 24.325973 20.638823 -1.964366 0.753947 -1.290811 -3.547422 -1.996611 -2.877422 0.813014 -3.547227 1.632365 2.083673 0.472015 3.723311 -0.719116 3.676793 ...

...我试图有效地从其中，并在它们出现在字符串中的交错序列中，抓取...

文本标记浮点值空行

...但是我遇到了麻烦。

我尝试了 strtod 并成功地从字符串中抓取了浮点数，但我似乎无法使用 strtod 获得一个循环来向我报告交错的文本标记和空行。考虑到我也感兴趣的交错标记和空白行，我不是 100% 确信 strtod 是“正确的轨道”。

标记和空行出现在字符串中以提供浮点数的上下文，因此我的程序知道每个标记后出现的浮点值将用于什么，但 strtod 似乎更适合，可以理解的是，只报告返回浮点数它在字符串中遇到，而不考虑诸如空行或标记之类的愚蠢事物。

我知道这在概念上并不难，但是对于 C/C++ 来说相对较新，我很难判断我应该关注哪些语言特性以充分利用 C/C++ 可以带来的效率问题。

有什么建议吗？我对为什么各种方法或多或少有效地发挥作用非常感兴趣。谢谢！！！

【问题讨论】：

尝试fgets() 和sscanf()。 【参考方案1】：

使用 C，我会做这样的事情（未经测试）：

#include <stdio.h>

#define MAX 128

char buf[MAX];
while (fgets(buf, sizeof buf, fp) != NULL) 
    double d1, d2;
    if (buf[0] == '\n') 
        /* saw blank line */
     else if (sscanf(buf, "%lf%lf", &d1, &d2) != 2) 
        /* buf has the next text token, including '\n' */
     else 
        /* use the two doubles, d1, and d2 */

首先检查空行是因为它相对便宜。根据您的需要：

MAX

buf

malloc()

realloc()

sscanf()

我还假设空行确实是空白的（只是换行符本身）。如果没有，您将需要跳过前导空格。 isspace() in ctype.h 在这种情况下很有用。

fp 是由fopen() 返回的有效FILE * 对象。

【讨论】：

5.您可能想要检测格式错误的输入（例如“1.0 1.0foo”）。（如果您想使用sscanf 而不是strtod，可以使用"%lf%lf%c" 作为格式字符串，并验证没有获得任何字符或者它是换行符。）【参考方案2】：

哇，我不再用 C 写很多解析器了

这已经在 OP 的输入上进行了测试

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef enum 
  scan_blank, scan_label, scan_float
 tokens;

double f1, f2;

char line[512], string_token[sizeof line];

tokens scan(void) 
  char *s;
  for(s = line; *s; ++s) 
    switch(*s) 
      case ' ':
      case '\t':
        continue;
      case '\n':
        return scan_blank;
      case '0': case '1': case '2': case '3': case '4':
      case '5': case '6': case '7': case '8': case '9':
      case '.': case '-':
        sscanf(line, " %lf %lf", &f1, &f2);
        return scan_float;
      default:
        sscanf(line, " %s", string_token);
        return scan_label;
    
    abort();
  
  abort();


int main(void) 
  int n;
  for(n = 1;; ++n) 
    if (fgets(line, sizeof line, stdin) == NULL)
      return 0;
    printf("%2d %-40.*s", n, (int)strlen(line)-1, line);
    switch(scan()) 
      case scan_blank:
        printf("blank\n");
        break;
      case scan_label:
        printf("label [%s]\n", string_token);
        break;
      case scan_float:
        printf("floats [%lf %lf]\n", f1, f2);
        break;

【讨论】：

【参考方案3】：

这有点粗略且未经测试，但总体思路是尝试解析每一行并查看其中的内容：

while (!feof (stdin))

    char buf [100];
    (!fgets (buf, sizeof buf, stdin))
        break;  // end of file or error

    // skip leading whitespace
    char *cp = buf;
    while (isspace (*cp))
         ++cp;

    if (*cp == '\000')  // blank line?
    
        do_whatever_for_a_blank_line ();
        continue;
    

    // try reading a float
    double v1, v2;
    char *ep = NULL;
    v1 = strtod (cp, &ep);
    if (ep == cp)   // if nothing parsed
    
        do_whatever_for_a_text_token (cp);
        continue;
    

    while (isspace (*cp))
       ++cp;
    ep = NULL;
    v2 = strtod (cp, &ep);
    if (ep == cp)   // if no float parsed
    
         handle_single_floating_value (v1);
         continue;
    
    handle_two_floats (v1, v2);

【讨论】：

cmets: fgets 返回 char *，而不是 int。大多数时候，while(!feof(fp)) ... 在 C:c-faq.com/stdio/feof.html 中是错误的。查看您的代码更多，fgets() 返回值已修复，您没有上面链接中提到的错误。尽管如此，我还是会将fgets() 本身移动到while 的条件部分。（我无法编辑我的最后一条评论，因此是一条新评论。）完全正确。我已经相应地修复了fgets()。虽然feof() 有问题，但如上所示将它与fgets 结合起来效果很好。

以上是关于使用 C/C++ 有效地反序列化由浮点数、标记和空行组成的字符串的主要内容，如果未能解决你的问题，请参考以下文章