计算排序字符串的算法（自制“uniq -c”）

Posted 2023-02-22

技术标签:

【中文标题】计算排序字符串的算法（自制“uniq -c”）【英文标题】：Algorithm for Counting Sorted Strings (Homebrew "uniq -c") 【发布时间】：2009-03-11 09:38:10 【问题描述】：

我有以下排序数据：

AAA
AAA
TCG
TTT
TTT
TTT

我要统计每个字符串的出现次数：

AAA 2
TCG 1
TTT 3

我知道我可以使用 uniq -c 做到这一点，但在这里我需要对我拥有的整个 C++ 代码进行额外处理。

我被这个结构卡住了（根据“pgras”的建议修改）：

#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
using namespace std;


int main  ( int arg_count, char *arg_vec[] ) 
    if (arg_count !=2 ) 
        cerr << "expected one argument" << endl;
        return EXIT_FAILURE;
    

    string line;
    ifstream myfile (arg_vec[1]);


    if (myfile.is_open())
    
        int count;
        string lastTag = "";

        while (getline(myfile,line) )
        
            stringstream ss(line);
            string Tag;

            ss >> Tag; // read first column
            //cout << Tag << endl; 

            if (Tag != lastTag) 
               lastTag = Tag;
               count = 0;
            
            else 
                count++;
            

             cout << lastTag << " " << count << endl;
        
        cout << lastTag << " " << count << endl;
        myfile.close();

    
    else cout << "Unable to open file";
    return 0;

它打印出这个错误的结果：

AAA 0
AAA 1
TCT 0
TTT 0
TTT 1
TTT 2
TTT 2

【问题讨论】：

这不会编译。例如，未定义计数。我也不清楚您的“额外处理”是什么。你能说得具体点吗？ @John：我需要通过提供一些值来处理该 uniq 标签，并再次打印这些标签以及计数，例如AAA 2 -40 40 40 对不起，我还是不清楚。你最后一个例子中的“-40 40 40”是什么？ @John：这是一个将标签作为参数的函数给出的额外内容将count = 0替换为count = 1 去掉“cout 【参考方案1】：

当标签与 lastTag 不同时，您必须重置计数器，如果相同则递增...当标签不同时，您可以使用其关联的计数值处理前一个标签（在重置计数之前）...

【讨论】：

@pgras：我修改了，但也许我还是不明白。【参考方案2】：

如果你只是想打印出来，你的算法是可以的。如果你想将它传递给另一个函数，你可以使用例如 STL map。

map<string, int> dict;

while(getline(myfile,line)) 
          string Tag;
          stringstream ss(line);
          ss >> Tag;
          if (dict.count(Tag) == 0) 
               dict[Tag] = 1;
           else
               dict[Tag]++;

【讨论】：

您不需要循环内的额外if。如果不存在，则 [] 运算符创建一个默认构造的项。【参考方案3】：

使用这样的东西：

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <map>
#include <iterator>


std::ostream& operator << ( std::ostream& out, const std::pair< std::string, size_t >& rhs )

    out << rhs.first << ", " << rhs.second;
    return out;


int main() 

    std::ifstream inp( "mysorted_data.txt" );
    std::string str;
    std::map < std::string, size_t > words_count;
    while ( inp >> str )
    
        words_count[str]++;
    

    std::copy( 
        words_count.begin(), 
        words_count.end(), 
        std::ostream_iterator< std::pair< std::string, size_t > >( std::cout, "\n" ) );

    return 0;

【讨论】：

【参考方案4】：

假设您的数据确实包含长度为 3 的 DNA 字符串（或更一般的长度 N，其中 N 非常小），您可以通过使用一个 q-gram 表，它是一个特殊的哈希表，表大小为 4^N 和以下哈希函数：

// Disregard error codes.
int char2dna_lookup[] = 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x0  – 0xF
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x10 – 0x1F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x20 – 0x2F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x30 – 0x3F
    0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, // A    – P
    0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // Q    – 0x5F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x60 – 0x6F
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x70 – 0x7F


unsigned int hash(string const& dna) 
    unsigned int ret = 0;

    for (unsigned int i = 0; i < dna.length(); ++i)
        ret = ret * 4 + char2dna_lookup[dna[i]];

    return ret;

您现在可以非常高效地索引您的数组。

ifstream ifs("data.txt");
string line;

if (not ifs >> line)
    exit(1);

unsigned* frequencies = new unsigned int[line.length()];

frequencies[hash(line)] = 1;

while (ifs >> line)
    ++frequencies[hash(line)];

// Print the frequencies …

delete[] frequencies;

或者，使用诸如SeqAn 之类的库来执行此类任务。

【讨论】：

注意，代码未经测试。查找表（或其他地方）可能有错误。【参考方案5】：

我认为你所要做的就是替换这个

        if (Tag != lastTag) 
           lastTag = Tag;
           count = 0;
        
        else 
            count++;
        

        cout << lastTag << " " << count << endl;

用这个：

        if (Tag != lastTag) 
            if (lastTag != "")   // don't print initial empty tag
                cout << lastTag << " " << count << endl;
            
            lastTag = Tag;
            count = 1; // count current
           else 
            count++;

【讨论】：

【参考方案6】：

您的代码在语法上看起来有些破旧（ifstream，...），但我认为整体算法是合理的。读取行，并在每次该行与之前的行相同时增加一个计数器。可能需要考虑一些边界条件（如果输入只有一行怎么办？），但您会在测试期间发现这些条件。

【讨论】：

并记得以-1开头的初始项目，否则问题会有点错误。 ;) 也就是说，直到现在的其他答案都没有那么有效。【参考方案7】：

使用 stringstream 来获取标签似乎有点矫枉过正——我可能会使用 string::substr。除此之外，您认为您的代码有什么问题？你想改进什么？

编辑：接下来，我们将因呼吸而被否决……

【讨论】：

【参考方案8】：

#include <map>
#include <string>
#include <algorithm>
#include <iterator>
#include <iostream>

class Counter
   private: std::map<std::string,int>&   m_count;
    public:  Counter(std::map<std::string,int>& data) :m_count(data)
        void operator()(std::string const& word)
        
            m_count[word]++;
        ;
class Printer
   private: std::ostream& m_out;
    public:  Printer(std::ostream& out) :m_out(out) 
        void operator()(std::map<std::string,int>::value_type const& data)
        
            m_out << data.first << " = " << data.second << "\n";
        ;

int main()

    std::map<std::string,int>       count;

    for_each(std::istream_iterator<std::string>(std::cin),
             std::istream_iterator<std::string>(),
             Counter(count)
            );

    for_each(count.begin(),count.end(),
             Printer(std::cout)
            );

【讨论】：

以上是关于计算排序字符串的算法（自制“uniq -c”）的主要内容，如果未能解决你的问题，请参考以下文章