在 C++ 中解析具有特定格式的字符串

Posted 2023-02-16

技术标签:

【中文标题】在 C++ 中解析具有特定格式的字符串【英文标题】：Parsing a string in c++ with a specfic format 【发布时间】：2021-12-28 18:26:13 【问题描述】：

我有这个字符串post "ola tudo bem como esta" alghero.jpg，我想把它分成三部分post，ola tudo bem como esta（我不想要“”）和alghero.jpg我在c中试过，因为我是新的，不是很好在用 c++ 编程，但它不起作用。在 c++ 中有没有更有效的方法来做到这一点？

程序：

int main()

    char* token1 = new char[128];
    char* token2 = new char[128];
    char* token3 = new char[128];
    char str[] = "post \"ola tudo bem como esta\" alghero.jpg";
    char *token;
   
    /* get the first token */
    token = strtok(str, " ");
    //walk through other tokens
    while( token != NULL ) 
        printf( " %s\n", token );
        
        token = strtok(NULL, " ");
    
    return(0);

【问题讨论】：

std::stringstream 可能会有所帮助使用std::string可以使用find和substr。但究竟如何？当你说你需要解析一些东西时，重要的是要清楚你期望数据如何被格式化。也就是说，如果您只想解析一个字符串，那么您可以对结果进行硬编码，因此大概您想要解析与post "ola tudo bem como esta" alghero.jpg 格式相同的其他字符串。我们对这些字符串有什么期望？总是由单个空格分隔的三元组 ? 是的，我想解析格式为的字符串，但我不想要引用字符串的引号 【参考方案1】：

在 C++14 及更高版本中，您可以使用std::quoted 从任何std::istream 中读取带引号的字符串，例如std::istringstream，例如：

#include <iostream>
#include <sstream>
#include <string>
#include <iomanip>

int main()

    std::string token1, token2, token3;
    std::string str = "post \"ola tudo bem como esta\" alghero.jpg";
   
    std::istringstream(str) >> token1 >> std::quoted(token2) >> token3;

    std::cout << token1 << "\n";
    std::cout << token2 << "\n";
    std::cout << token3 << "\n";

    return 0;

【讨论】：

【参考方案2】：

使用find 查找两个引号的位置。使用substr 获取从索引 0 到第一个引号、第一个引号到第二个引号、第二个引号到结尾的字符串。

std::string s = "post \"ola tudo bem como esta\" alghero.jpg";
auto first = s.find('\"');
if (first != s.npos) 
    auto second = s.find('\"', first + 1);
    if (second != s.npos) 
        std::cout << s.substr(0, first-1) << '\n';
        std::cout << s.substr(first+1, second-first-1) << '\n';
        std::cout << s.substr(second+2) << '\n';

输出：

post
ola tudo bem como esta
alghero.jpg

【讨论】：

【参考方案3】：

解析字符串的一个选项是使用正则表达式，例如：

#include <iostream>
#include <regex>
#include <string>

// struct to hold return value of parse function
struct parse_result_t

    bool parsed false ;
    std::string token1;
    std::string token2;
    std::string token3;
;

// the parse function
auto parse(const std::string& string)

    // this is a regex 
    // ^ match start of line
    // (.*)\\\" matches any character until a \" (escaped ") and then escaped again for C++ string
    // \w+ match one or more whitepsaces
    // (.*)$ match 0 or more characters until end of string
    // see it live here : https://regex101.com/r/XnkAZV/1
    static std::regex rx "^(.*?)\\s+\\\"(.*?)\\\"\\s+(.*)$" ;

    std::smatch match;
    parse_result_t result;

    if (std::regex_search(string, match, rx))
    
        result.parsed = true;
        result.token1 = match[1];
        result.token2 = match[2];
        result.token3 = match[3];
    
    
    return result;


int main()

    auto result = parse("post \"ola tudo bem como esta\" alghero.jpg");

    std::cout << "parse result = " << (result.parsed ? "success" : "failed") << "\n";
    std::cout << "token 1 = " << result.token1 << "\n";
    std::cout << "token 2 = " << result.token2 << "\n";
    std::cout << "token 3 = " << result.token3 << "\n";

    return 0;

【讨论】：

【参考方案4】：

如果字符串总是由一个空格分隔，您可以使用std::string::find 和 std::string::rfind` 找到第一个空格和最后一个空格，拆分这些字符，然后取消引用中间字符串：

#include <iostream>
#include <tuple>
#include <string>

std::string unquote(const std::string& str) 
    if (str.front() != '"' || str.back() != '"') 
        return str;
    
    return str.substr(1, str.size() - 2);


std::tuple < std::string, std::string, std::string> parse_triple_with_quoted_middle(const std::string& str) 
    auto iter1 = str.begin() + str.find(' ');
    auto iter2 = str.begin() + str.rfind(' ');

    auto str1 = std::string(str.begin(),iter1);
    auto str2 = std::string(iter1 + 1, iter2);
    auto str3 = std::string(iter2 + 1, str.end() );

    return  str1, unquote(str2), str3 ;


int main()

    std::string test = "post \"ola tudo bem como esta\" alghero.jpg";
    auto [str1, str2, str3] = parse_triple_with_quoted_middle(test);
    std::cout << str1 << "\n";
    std::cout << str2 << "\n";
    std::cout << str3 << "\n";

不过，您可能应该在上面添加更多输入验证。

【讨论】：

【参考方案5】：

您可以为此使用正则表达式：

重复搜索的模式是：可选地以空格开头\s*；然后([^\"]*) 除引号之外的零个或多个字符（零个或多个，因为您可以一个接一个地使用多个引号）；我们捕获了这个组（因此使用了括号）；最后，无论是引用\" 还是| 表达式的结尾$；而且我们不会捕获(:?) 这个组。我们使用std::regex 来存储模式，将其全部包装在R"()" 中，这样我们就可以编写原始表达式。 while 循环做了一些事情：它使用regex_search 搜索下一个匹配项，提取捕获的组，并更新输入行，以便下一次搜索将从当前搜索完成的位置开始。matches 是一个数组，其第一个元素 matches[0] 是匹配整个模式的line 的一部分，下一个元素对应于模式的捕获组。

[Demo]

#include <iostream>  // cout
#include <regex>  // regex_search, smatch

int main() 
    std::string line"post \"ola tudo bem como esta\" alghero.jpg";
    std::regex patternR"(\s*([^\"]*)(:?\"|$))";
    std::smatch matches;
    while (std::regex_search(line, matches, pattern))
    
         std::cout << matches[1] << "\n";
         line = matches.suffix();

【讨论】：

以上是关于在 C++ 中解析具有特定格式的字符串的主要内容，如果未能解决你的问题，请参考以下文章

C++ 如何解析xml格式的字符串信息,不是文件中的

如何解析具有特定格式数据的文本文件并使用 perl 将其存储在哈希中

获取完整的字符串，同时具有格式和参数 c++

将字符串转换为 Excel 日期和时间，具有特定的字符串格式

c++ JsonCpp Parse对Json字符串解析转换判断的补充 Json格式验证

JSON解析