从文件中删除重复的反向行的问题

Posted 2023-02-22

技术标签:

【中文标题】从文件中删除重复的反向行的问题【英文标题】：Problem in removing duplicate reverse lines from file 【发布时间】：2021-10-07 05:46:58 【问题描述】：

我有一个包含以下行的文件，

connection list
current check OK

connect "A" to "B"
connect "A" to "C"
connect "A" to "D"
connect "C" to "A"
connect "A" to "E"

这里将“C”连接到“A”是将“A”连接到“C”的反向连接要求是删除重复的反向连接。

我是 C++ 和向量的新手。我尝试使用以下内容：

首先我采用了 2 个字符串 con1 和 con2 的结构：connectPair 然后我取了一个结构的向量

现在，我将文件行保存到向量中：rawFileLines 我正在尝试对 rawFileLines 进行操作以查找连接组件。

我将连接组件存储到另一个向量：值

这是我的代码：

typedef struct 
    std::string con1;
    std::string con2;
 ConnectPair;

void RemoveReversePairs(std::string inputFile) 
    std::vector<std::string> fileData;
    std::string line, scan, token1, token2;
    std::size_t tokenLeft, tokenRight, maxLines, lineNumber = 0, pos = 0;
    std::size_t found = 0, storeCount = 0;
    
    std::vector<std::string> rawFileLines;
    
    ConnectPair connectPair = ;
    std::vector<ConnectPair> values;
    


    std::ifstream source(inputFile.c_str());
    while (std::getline(source, line)) 
        rawFileLines.push_back(line);
    
    source.close();
    maxLines = rawFileLines.size();

    for (size_t i = 0; i < maxLines; i++) 
        line = rawFileLines[i];
        pos = 0;
        scan = "\"";
        found = 0;

        while (found < 2) /*line.find(scan, pos) != std::string::npos*/ 
            tokenLeft = line.find(scan, pos);
            tokenRight = line.find(scan, tokenLeft + 1);

            if ((tokenLeft != std::string::npos) && (tokenRight != std::string::npos)) 
                found++;
                if (found == 1) 
                    connectPair.con1 = line.substr(tokenLeft + 1, (tokenRight - tokenLeft) - 1);
                
                else if (found == 2) 
                    connectPair.con2 = line.substr(tokenLeft + 1, (tokenRight - tokenLeft) - 1);
                    values.push_back(connectPair);
                    storeCount++;
                
                pos = tokenRight + 1;
            
            else 
                connectPair.con1 = "++";
                connectPair.con2 = "++";
                values.push_back(connectPair);

                fileData.push_back(line);
                break;

现在，我无法比较连接。请建议我如何进行。

谢谢。

【问题讨论】：

一次一个问题。你能正确读入所有的连接吗？是的，我可以读取所有连接另外，请注意当STL提供std::pair时，不需要定义ConnectPair。 if (found == 2)判断不会被执行，因为while (found <2)导致循环在found等于2时结束。如果您需要检查大量数据，使用 std::map 和 std:;set 可能会有所帮助。 【参考方案1】：

按照您在 cmets 中提到的对您有用的代码保留读取与您的连接的代码，考虑使用算法标头的 find_if 和 lambda 的 STL 解决方案。

为简单起见，我有一个 std::vector<std::pair<std::string, std::string>> 填充了您的示例连接数据。

我使用循环来打印它以确保数据符合我的预期。这个循环使用destructuring 来减少很多烦人的样板。

然后才是真正的解决方案。我们使用显式迭代器来循环向量，使用std::find_if 检查向量的 rest 是否相同或反转时相同的连接。如果std::find_if 返回结束迭代器，它没有找到任何东西，我们可以将该对推回到map2 向量上。如果向量的其余部分确实存在等价物，则当前对不会被推送到 map2 向量上。

在 lambda 中，重要的是我们捕获当前的 iter，以便我们可以将其与其余部分（由 lambda b 的参数表示）进行比较。

[&iter](auto b) 
    return (iter->first == b.first  && iter->second == b.second) ||
           (iter->first == b.second && iter->second == b.first );

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>

int main() 
    std::vector<std::pair<std::string, std::string>> map, map2;

    map.push_back("A", "B");
    map.push_back("A", "C");
    map.push_back("A", "D");
    map.push_back("C", "A");
    map.push_back("A", "E");

    std::cout << "Before:" << std::endl;

    for (auto &[k, v] : map) 
        std::cout << k << " -> " << v << std::endl;
    

    auto end = map.end();

    for (auto iter = map.begin(); iter != end; iter++) 
        if (std::find_if(iter + 1, end,
                         [&iter](auto b) 
                             return (iter->first == b.first  && iter->second == b.second) ||
                                    (iter->first == b.second && iter->second == b.first );
                         ) == end) 
            map2.push_back(*iter);
        
    

    std::cout << "After: " << std::endl;

    for (auto &[k, v] : map2) 
        std::cout << k << " -> " << v << std::endl;

结果：

Before:
A -> B
A -> C
A -> D
C -> A
A -> E
After:
A -> B
A -> D
C -> A
A -> E

更好

在考虑了我之前的示例之后，我意识到如果我们使用相同的比较逻辑（lambda 未更改）来检查每个连接是否已经存在于map2 中会更简单。

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>

int main() 
    std::vector<std::pair<std::string, std::string>> map, map2;

    map.push_back("A", "B");
    map.push_back("A", "C");
    map.push_back("A", "D");
    map.push_back("C", "A");
    map.push_back("A", "E");

    std::cout << "Before:" << std::endl;

    for (auto &[k, v] : map) 
        std::cout << k << " -> " << v << std::endl;
    

    auto end = map.end();

    for (auto iter = map.begin(); iter != end; iter++) 
        if (std::find_if(map2.begin(), map2.end(),
                         [&iter](auto b) 
                             return (iter->first == b.first  && iter->second == b.second) ||
                                    (iter->first == b.second && iter->second == b.first );
                         ) == map2.end()) 
            map2.push_back(*iter);
        
    

    std::cout << "After: " << std::endl;

    for (auto &[k, v] : map2) 
        std::cout << k << " -> " << v << std::endl;

这样做的另一个好处是我们现在获得了map2 中“重复”的第一个，而不是最后一个。

Before:
A -> B
A -> C
A -> D
C -> A
A -> E
After:
A -> B
A -> C
A -> D
A -> E

【讨论】：

非常好！但是，也许有点复杂。请参阅下面的方法。最好是使用现有的容器来隐含地做你想做的事情。下面，大多数行都用于 IO。最后，我们在 main 中有一个单行。 . . @Armin Montigny，我打算混合使用 STL 容器/函数和足够的“手动”逻辑，使其易于消化而不是“魔法”。【参考方案2】：

首先让我说我认为你的代码有点复杂。

那么，接下来。要删除重复项，您可以使用erase / remove_if idiom。

您需要放在函数末尾的代码片段可能是：

    int i = 1;
    while (i < values.size()) 
        values.erase(std::remove_if(values.begin(), values.end(),
            [&](const ConnectPair& cp)-> bool
             return ((cp.con1 == values[i].con1) && (cp.con2 == values[i].con2)) || ((cp.con1 == values[i].con2) && (cp.con2 == values[i].con1)); ),
            values.end());
        ++i;

这里重要的是比较功能。您进行 1 对 1 比较，另外您将 con1 与 con2 进行比较，反之亦然。

但是让我说。生活可以更轻松。您可以在结构中添加一个 compare 函数。那将是更面向对象的方法。然后你可以在一个合适的容器中使用你的结构，比如std::set，它不允许重复。

因为我们不会使用方向，而是使用连接，所以我们可以简单地对第一个和第二个元素进行排序。这使得比较变得非常简单。

而整个读取数据并完成所有任务，可以在main中的一行代码中完成。

请看：

#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <regex>
#include <utility>
#include <set>

const std::regex re R"(\"(\w+)\")" ;

struct Terminal 
    // Store undirected connection in a noram std::pair
    std::pair<std::string, std::string> end;

    // Read new connection from stream
    friend std::istream& operator >> (std::istream& is, Terminal& t) 
        bool found;
        // Read a line, until we found a connection or until eof
        for (std::string line; not found and std::getline(is, line);) 
            // Get connection end names
            if (std::vector ends(std::sregex_token_iterator(line.begin(), line.end(), re), ); found = (ends.size() == 2)) 
                t.end = std::minmax(ends[0], ends[1]);
        return is;
    
    bool operator < (const Terminal& other) const  return end < other.end;  
;

int main() 
    // Open file and check, if it could be opened
    if (std::ifstream inputFileStream "r:\\data.txt" ; inputFileStream) 

        // Read complete data, without doubles into our container
        std::set data(std::istream_iterator<Terminal>(inputFileStream), );

        // Debug output
        for (const auto& d : data) std::cout << d.end.first << " <-> " << d.end.second << '\n';

请注意，如果您需要原始数据，那么可以在结构中添加原始对的一行。

【讨论】：

【参考方案3】：

因为您的连接是隐式单向的。

如果数据需要处理大量连接，我建议使用std::unordered_map<std::string,set::unordered_set<std::string>>。

因为unordered_map 和unordered_set 都有固定的时间来查找average，但插入需要更长的时间。

我借了Chris's code来构造数据。

请注意，Chris 的示例在您的数据不大的情况下已经足够了。

Live demo

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <map>
#include <unordered_set>

int main() 
    std::vector<std::pair<std::string, std::string>> map;

    map.push_back("A", "B");
    map.push_back("A", "C");
    map.push_back("A", "D");
    map.push_back("D", "A");
    map.push_back("C", "A");
    map.push_back("A", "E");
    map.push_back("E", "A");

    std::cout << "Before:" << std::endl;

    for (auto &[k, v] : map) 
        std::cout << k << " -> " << v << std::endl;
    
    std::unordered_map<std::string,std::unordered_set<std::string>> connection;
    for (auto &[k, v] : map) 

        // Existed connection
        if((connection[k].find(v) != connection[k].end()) || (connection[v].find(k) != connection[v].end()) )
            continue;
        
        connection[k].insert(v);
    

    std::cout << "After: " << std::endl;

    for (auto &[k, v] : connection) 

        for(auto& item : v)
            std::cout << k << " -> " << item << std::endl;

【讨论】：

以上是关于从文件中删除重复的反向行的问题的主要内容，如果未能解决你的问题，请参考以下文章

MySQL：从 MySQL TABLE 中删除重复行的最安全方法是啥？

如何从顶部文件或标准输出中删除 n 行（即撕掉它的标题）[重复]

查找重复行的索引 [重复]

删除子列表重复的内容，包括反向的

从INSANE BIG WORDLIST中删除重复项