惯用地拆分 string_view

Posted

技术标签:

【中文标题】惯用地拆分 string_view【英文标题】:Idiomatically split a string_view 【发布时间】:2017-12-28 18:28:33 【问题描述】:

我阅读了The most elegant way to iterate the words of a string 并享受了简洁的答案。现在我想对 string_view 做同样的事情。问题是,stringstream 不能接受string_view

#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>

int main() 
    using namespace std;
    string_view sentence = "And I feel fine...";
    istringstream iss(sentence); // <== error
    copy(istream_iterator<string_view>(iss),
         istream_iterator<string_view>(),
         ostream_iterator<string_view>(cout, "\n"));

那么有没有办法做到这一点?如果不是,那么为什么这样的事情不惯用?

【问题讨论】:

【参考方案1】:

用分隔符分割并返回vector&lt;string_view&gt;

设计用于在.csv 文件中快速拆分行。

MSVC 2017 v15.9.6Intel Compiler v19.0 下测试,使用C++17 编译(string_view 需要)。

#include <string_view>

std::vector<std::string_view> Split(const std::string_view str, const char delim = ',')
   
    std::vector<std::string_view> result;

    int indexCommaToLeftOfColumn = 0;
    int indexCommaToRightOfColumn = -1;

    for (int i=0;i<static_cast<int>(str.size());i++)
    
        if (str[i] == delim)
        
            indexCommaToLeftOfColumn = indexCommaToRightOfColumn;
            indexCommaToRightOfColumn = i;
            int index = indexCommaToLeftOfColumn + 1;
            int length = indexCommaToRightOfColumn - index;

            // Bounds checking can be omitted as logically, this code can never be invoked 
            // Try it: put a breakpoint here and run the unit tests.
            /*if (index + length >= static_cast<int>(str.size()))
            
                length--;
                           
            if (length < 0)
            
                length = 0;
            */

            std::string_view column(str.data() + index, length);
            result.push_back(column);
        
    
    const std::string_view finalColumn(str.data() + indexCommaToRightOfColumn + 1, str.size() - indexCommaToRightOfColumn - 1);
    result.push_back(finalColumn);
    return result;

注意生命周期:string_view 永远不应该比它是一个窗口的父 string 寿命长。如果父 string 超出范围,则 string_view 指向的内容无效。在这种特殊情况下,API 设计很难出错,因为输入/输出都是string_view,它们都是父字符串的窗口。这最终在内存复制和 CPU 使用方面相当有效。

请注意,如果使用string_view,唯一的缺点是会丢失隐式空终止。所以使用支持string_view的函数,例如Boost 中的lexical_cast 函数用于将字符串转换为数字。

我用它来快速解析一个 .csv 文件。为了获取 .csv 文件中的每一行,我使用了 istringstreamgetLine(),这非常快(约 2GB/秒或单核每秒 1,200,000 行)。

单元测试。使用Google Test 进行测试(我使用vcpkg 安装)。

// Google Test integrates into VS2017 if ReSharper is installed. 
#include "gtest/gtest.h" // Can install using vcpkg
// In main(), call:   
// ::testing::InitGoogleTest(&argc, argv);return RUN_ALL_TESTS();

TEST(Strings, Split)

    
        const std::string str = "A,B,C";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "C");
           
    
        const std::string str = ",B,C";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "C");
    
    
        const std::string str = "A,B,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "B");
        EXPECT_TRUE(tokens[2] == "");
    
    
        const std::string str = "";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 1);
        EXPECT_TRUE(tokens[0] == "");
    
    
        const std::string str =  "A";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 1);
        EXPECT_TRUE(tokens[0] == "A");
    
    
        const std::string str =  ",";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "");
    
    
        const std::string str =  ",,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 3);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "");
        EXPECT_TRUE(tokens[2] == "");
    
    
        const std::string str = "A,";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "A");
        EXPECT_TRUE(tokens[1] == "");
    
    
        const std::string str = ",B";
        auto tokens = Split(str, ',');
        EXPECT_TRUE(tokens.size() == 2);
        EXPECT_TRUE(tokens[0] == "");
        EXPECT_TRUE(tokens[1] == "B");
           

【讨论】:

您的代码质量看起来非常好,提供了 unittest 并使用 int 来避免 uint-minus 问题。【参考方案2】:

如果您想使用该特定方法,您只需将string_view 显式转换为string

istringstream issstring(sentence); // N.B. braces to avoid most vexing parse
copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     ostream_iterator<string_view>(cout, "\n"));

C++ 标准库没有很好的字符串操作功能。您可能想看看Boost、Abseil 等提供的内容。其中任何一个都比这更好。

【讨论】:

【参考方案3】:

stringstream 拥有它所操作的字符串。这意味着它会创建给定字符串的副本。它不能仅仅引用字符串。

即使使用建议的基于string_viewstream 类型,流仍然不是随机访问范围。他们没有办法处理字符串的子范围。这就是为什么他们通过复制而不是通过迭代器或其他方式从流中提取数据的原因。

您想要的最好通过基于regex 的机制完成,因为它无需复制任何内容即可工作。它们与string_views 一起工作得很好(尽管您必须手动构建string_views)。

【讨论】:

【参考方案4】:

Contango 的回答很好。为了适应项目中的 string 和 boost::string_view,我做了一些改动,并尝试摆脱复制构造函数。

以下代码将一个字符串拆分为string_view;

你必须保证字符串不会被破坏。

还有其他答案可能更语法优雅:检查一下:https://www.bfilipek.com/2018/07/string-view-perf-followup.html。 上面有一个istringstream版本,如果字符串本身很长,复制会有点问题,需要自己处理。


   typedef boost::string_view StringView; //Or you can just typedef std::string_view StringView;
#if defined(_WIN32) | defined(WIN32)
#pragma warning(push)
#pragma warning(disable:26486 26481)
#endif
        void SplitStringToStringView(const std::string& str, const char delim, std::vector<StringView>* outputPointer)
        
            if (outputPointer == nullptr)
                return;

            std::vector<StringView>& result = *outputPointer;

            int indexCommaToLeftOfColumn = 0;
            int indexCommaToRightOfColumn = -1;

            const int end = boost::numeric_cast<int>(str.size());
            for (int i = 0; i < end; i++)
            
                if (str.at(i) == delim)
                
                    indexCommaToLeftOfColumn = indexCommaToRightOfColumn;
                    indexCommaToRightOfColumn = i;
                    const int index = indexCommaToLeftOfColumn + 1;
                    const int length = indexCommaToRightOfColumn - index;

                    // Bounds checking can be omitted as logically, this code can never be invoked 
                    // Try it: put a breakpoint here and run the unit tests.
                    /*if (index + length >= static_cast<int>(str.size()))
                    
                        length--;
                    
                    if (length < 0)
                    
                        length = 0;
                    */

                    result.emplace_back(StringView(str.c_str() + index, length));
                
            
            const StringView finalColumn(str.c_str() + indexCommaToRightOfColumn + 1, 
                str.size() - indexCommaToRightOfColumn - 1);
            result.push_back(finalColumn);
        
#if defined(_WIN32) | defined(WIN32)
#pragma warning(pop)
#endif

由于 Contango 提供了单元测试代码,非常好,我也应该这样做:


            const std::string str = "A,B,C";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 3);
            EXPECT_TRUE(tokens[0] == "A");
            EXPECT_TRUE(tokens[1] == "B");
            EXPECT_TRUE(tokens[2] == "C");
        
        
            const std::string str = ",B,C";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 3);
            EXPECT_TRUE(tokens[0] == "");
            EXPECT_TRUE(tokens[1] == "B");
            EXPECT_TRUE(tokens[2] == "C");
        
        
            const std::string str = "A,B,";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 3);
            EXPECT_TRUE(tokens[0] == "A");
            EXPECT_TRUE(tokens[1] == "B");
            EXPECT_TRUE(tokens[2] == "");
        
        
            const std::string str = "";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 1);
            EXPECT_TRUE(tokens[0] == "");
        
        
            const std::string str = "A";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 1);
            EXPECT_TRUE(tokens[0] == "A");
        
        
            const std::string str = ",";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 2);
            EXPECT_TRUE(tokens[0] == "");
            EXPECT_TRUE(tokens[1] == "");
        
        
            const std::string str = ",,";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 3);
            EXPECT_TRUE(tokens[0] == "");
            EXPECT_TRUE(tokens[1] == "");
            EXPECT_TRUE(tokens[2] == "");
        
        
            const std::string str = "A,";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 2);
            EXPECT_TRUE(tokens[0] == "A");
            EXPECT_TRUE(tokens[1] == "");
        
        
            const std::string str = ",B";
            std::vector<StringView> tokens;
            SplitStringToStringView(str, ',', &tokens);
            EXPECT_TRUE(tokens.size() == 2);
            EXPECT_TRUE(tokens[0] == "");
            EXPECT_TRUE(tokens[1] == "B");
        

【讨论】:

以上是关于惯用地拆分 string_view的主要内容,如果未能解决你的问题,请参考以下文章

c++ boost::iterator_range<iter> string_view 错误

SQL 动态地将值拆分为新列

如何在点上拆分字符串并有效地提取所有字段?

如何根据 TCP 流有效地拆分 pcap 文件?

mysql - 动态地将列拆分为行

如何有效地将大型数据框拆分为多个拼花文件?