哈希图和向量很慢

Posted 2023-02-16

技术标签:

【中文标题】哈希图和向量很慢【英文标题】：Hash map and vector is slow 【发布时间】：2012-09-24 02:30:52 【问题描述】：

我有一个要处理的大型数据集（1.2 亿条记录）。我的程序目前正在使用 Google 密集哈希，但仍然需要 29 小时才能完成，并且使用了来自我的 64 GiB 服务器的 8.5 GiB 内存。

请问您有什么建议吗？我是 C++ 新手。如果我想用更快的东西替换向量，那会是什么？

#include <string>
#include <algorithm>
#include <tr1/unordered_map>
#include <iterator>
#include <sstream>
#include <cstring>
#include <iomanip>
#include <fstream>
#include <vector>
#include <iterator>
#include <time.h>
#include <iostream>
#include <iostream>
#include <sparsehash/dense_hash_map>
#include <stdio.h>
#include <string.h>
using google::dense_hash_map;  
using std::tr1::hash; 
using namespace std;
using std::string;

bool ProcessInput(const string& inChar, vector<string> *invector);
void Processmax( dense_hash_map < string, int>* ins, vector<int> *inc, vector<string>      *outs, vector<int> *outc);

int main()

time_t start, stop;
time(&start);
ofstream finall;
vector<int> usrsc,artc,tmusrc,tmart2c,atrsc,tmartc;
vector<string> tmart,tmusr,tmart2;
vector< vector<string> > usrlist,artlist;
string x1,x2;
ifstream ifTraceFile;
bool f,f2;
dense_hash_map < string, int > a;
dense_hash_map < string, int > u;
a.set_empty_key(string());
u.set_empty_key(string());

int kl=0;
ifTraceFile.open ("data2.tr", std::ifstream::in);
while (ifTraceFile.good ())

    ifTraceFile>>x1>> x2;


    if (kl==0)
    
        a.insert(make_pair(x1,0));
        u.insert(make_pair(x2,0));
        usrlist.push_back((vector<string>()));
        usrlist[0].push_back(x1);
        artlist.push_back((vector<string>()));
        artlist[0].push_back(x2);
        usrsc.push_back(1);
        artc.push_back(1);
        atrsc.push_back(1);

    
    else
    

        dense_hash_map < string, int>::iterator itn;
        itn=a.find(x1);
        if (itn == a.end())
        
            a.insert(make_pair(x1,(artlist.size())));
            artlist.push_back((vector<string>()));
            artlist[(artlist.size()-1)].push_back(x2);
            artc.push_back(1);
            atrsc.push_back(1);
        
        else
        
            f=ProcessInput(x2, &artlist[itn->second]);
            if(f)
            
                artlist[itn->second].push_back(x2);
                atrsc[itn->second]+=1;
                artc[itn->second]+=1;
            
            else
                atrsc[itn->second]+=1;

        


         dense_hash_map < string, int>::iterator its;
        its=u.find(x2);
        if (its == u.end())
        
            u.insert(make_pair(x2,(usrlist.size())));
            usrlist.push_back((vector<string>()));
            usrlist[(usrlist.size()-1)].push_back(x1);
            usrsc.push_back(1);

        
        else
        
            f2=ProcessInput(x1, &usrlist[its->second]);

            if(f2)
            
                usrlist[its->second].push_back(x1);
                usrsc[its->second]+=1;

            

        

    

    kl++;

ifTraceFile.close();
Processmax(&a, &artc, &tmart, &tmartc);
Processmax(&a, &atrsc, &tmart2 ,&tmart2c);
Processmax(&u, &usrsc ,&tmusr, &tmusrc);
int width=15;
cout <<"article has Max. review by users Top 1: "<<tmart.at(0)<<'\t'<<tmartc.at(0)<<endl;
cout <<"article has Max. review by users Top 2: "<<tmart.at(1)<<'\t'<<tmartc.at(1)<<endl;
cout <<"article has Max. review by users Top 3: "<<tmart.at(2)<<'\t'<<tmartc.at(2)<<endl;
cout <<endl;
cout <<"article has Max. review Top 1: "<<tmart2.at(0)<<'\t'<<tmart2c.at(0)<<endl;
cout <<"article has Max. review Top 2: "<<tmart2.at(1)<<'\t'<<tmart2c.at(1)<<endl;
cout <<"article has Max. review Top 3: "<<tmart2.at(2)<<'\t'<<tmart2c.at(2)<<endl;
cout <<endl;
cout <<"user who edited most articles Top 1: "<<tmusr.at(0)<<'\t'<<tmusrc.at(0)<<endl;
cout <<"user who edited most articles Top 2: "<<tmusr.at(1)<<'\t'<<tmusrc.at(1)<<endl;
cout <<"user who edited most articles Top 3: "<<tmusr.at(2)<<'\t'<<tmusrc.at(2)<<endl;

finall.open ("results");
finall << "Q1 results:"<<endl;;
finall <<"article has Max. review Top 1: "<<setw(width)<<tmart2.at(0)<<setw(width)<<tmart2c.at(0)<<endl;
finall <<"article has Max. review Top 2: "<<setw(width)<<tmart2.at(1)<<setw(width)<<tmart2c.at(1)<<endl;
finall <<"article has Max. review Top 3: "<<setw(width)<<tmart2.at(2)<<setw(width)<<tmart2c.at(2)<<endl;
finall<<endl;

finall<<"article has Max. review by users Top 1: "<<setw(width)<<tmart.at(0)<<setw(width)<<tmartc.at(0)<<endl;
finall <<"article has Max. review by users Top 2: "<<setw(width)<<tmart.at(1)<<setw(width)<<tmartc.at(1)<<endl;
finall <<"article has Max. review by users Top 3: "<<setw(width)<<tmart.at(2)<<setw(width)<<tmartc.at(2)<<endl;
finall<<endl;
finall <<"user edited most articles Top 1: "<<setw(width)<<tmusr.at(0)<<setw(width-5)<<tmusrc.at(0)<<endl;
finall <<"user edited most articles Top 2: "<<setw(width)<<tmusr.at(1)<<setw(width-5)<<tmusrc.at(1)<<endl;
finall <<"user edited most articles Top 3: "<<setw(width)<<tmusr.at(2)<<setw(width-5)<<tmusrc.at(2)<<endl;
finall.close ();
time(&stop);
cout<<"Finished in about "<< difftime(stop, start)<< " seconds"<<endl;

return 0;


void Processmax(  dense_hash_map< string,int >* ins, vector<int> *inc, vector<string> *outs, vector<int> *outc)

int index=0;
int l=0;
 dense_hash_map < string, int>:: iterator iti;
string value;
while(l!=4)

    vector<int>::iterator it=max_element(inc->begin(), inc->end());
    index = distance(inc->begin(), it);

    for (iti = ins->begin(); iti != ins->end(); ++iti)
    
        if (iti->second == index)
        
            value = iti->first;
            break;
        
    
    outs->push_back(value);
    outc->push_back(inc->at(index));
    inc->at(index)=0;
    l++;
  


bool ProcessInput(const string& inChar, vector<string> *invector)

 bool index=true;
 vector<string>::iterator it=find(invector->begin(), invector->end(), inChar);
 if (it!=invector->end())
    index=false;

 return index;

【问题讨论】：

提高应用程序速度的可靠方法是找出它的哪一部分是瓶颈。我建议你对这段代码进行概要分析，并找出其中花费的时间最多的部分。第二个。如果使用 29 小时的样本集，您将拥有更多足够的数据来仔细研究（预计需要超过 29 小时）。每条记录有多大？（输入文件中每条记录的平均字节数是多少？） 8.5 GiB / 120 M 记录建议每条记录大约 70 字节，忽略开销；有了开销，数据的每条记录要少得多。你#include多个头两次：<iterator>,<iostream>;同时包含<cstring> 和<string.h> 似乎很奇怪。如果这是 C++，您可能应该避免使用<stdio.h>。为什么不使用<ctime> 而不是<time.h>？等等。程序的要点是：第一是找到对文章进行最多编辑的用户（需要编号），第二是被最多用户编辑的文章（需要编号）。所以到目前为止我只读了两个属性。如果我能找到这段代码的其他内容 【参考方案1】：

从您打印的数据来看，您正试图仅列出多个类别中的前三名左右的用户。无需存储所有数据，您只需存储每个类别中当前排名前三的用户。当新记录到达时，您确定它是否替换任何类别中的前三项中的任何一项，如果是，则安排新数据替换适当的旧数据。如果新记录“无趣”，则忽略它。将感兴趣的用户数量作为计算的参数；求解top N的一般情况，然后设置N为3。

这会将您的存储空间限制为最多几个 KiB。您还需要处理更小的数据结构，因此它们的速度会大大提高。您的处理时间应该会减少到读取该大小文件所需的时间，而不是 29 小时。

【讨论】：

感谢您的建议，我知道我已经包含了额外的标题，我会修复它。在我阅读所有数据之前，我无法判断哪个用户做出了最大的贡献。我的样本只有 1000 条记录，但我的真实数据是 1.2 亿条记录。 29小时后我得到了结果。我只是收集两个属性，但将来我必须收集其中五个。我认为瓶颈在矢量搜索中也在集成中。我想用搜索速度更快的东西替换向量。像 unorderd_set 的映射和像列表映射这样的集成。谢谢如果您在进行过程中不断积累数据，那就更难了。您是否考虑过使用 DBMS？给定一个合理的模式，他们应该从仅仅 1.2 亿行中制作肉馅。基本上，我认为如果你想更快地得到答案，你需要彻底改变你的算法。创造性思考;做一些非常不同的事情。我还需要记录谁编辑了哪些文章，反之亦然，因此我必须检查该用户的姓名和文章是否存在并且需要很长时间？ 1.2 亿行是数据库大小，但其中大部分是我不需要保留的重复行。如果重复，我会忽略它们，所以我必须保留的可能只有 1000 万或更少，因为它只需要 16Gig，而且我使用的是 2 个表而不是 1 个我检查了 1000 万条记录输入的地图大小。第一个表大小是 20000 条记录，第二个是 45000 条记录。所以看起来很多记录都是重复的（我忽略了任何重复的记录）。如果你能帮我找到可以快速查找的二维向量的替代品。【参考方案2】：

您可能需要遵循几个简单的步骤：

获取数据的子集（例如 1/100 或 1/1000）通过分析器下的示例数据运行您的程序找到瓶颈并优化它

一些阅读链接：

http://www.sgi.com/tech/stl/complexity.html

Quick and dirty way to profile your code

vector vs. list in STL

【讨论】：

样本数据不要太小，否则无法反映缓存未命中。感谢链接，我在 50000 记录上测试我的程序。我发现谷歌密集如果速度很快并且不会像矢量一样受大小影响。我是编程新手，但我所做的是我在我的 IDE (neatbeans) 上运行程序并暂停它并检查堆栈调用 10 次并且“find_vector algo”总是在堆栈调用。我希望我正确理解堆栈调用。我在 1000 万条记录上对其进行了测试。我正在做的是使用哈希映射值作为二维向量的索引。我正在做的是例如：“username1（map）：（vector）article1，article2，article2”向量接受索引。我使用用户名作为键，它的值（int）作为指向向量行的指针：向量 [hash_map 值]。我可以做一些更快的事情吗，比如带有 2d 地图而不是矢量的地图。谢谢我使用时间戳算法来计算最大等待时间。结果如下：最大向量搜索时间为 0.0018，而我最大的向量有 5000 个元素，而哈希映射最大查找时间仅为 0.00017 以获取 20000 个元素。所以我似乎矢量搜索需要很长时间。我可以用地图或使用哈希的东西替换它吗？我只是在使用插入和查找操作。【参考方案3】：

感谢您的帮助。我现在可以在 10 分钟内得到结果。只有！！！！！！！！！

  unordered_map < string, unordered_set <string> > a;
  unordered_map < string, unordered_set <string> > u;
  unordered_map < string, int > artc,usrc,artac;
    .....
    ....
   if (true)
      
        a[x1].insert(x2);
       u[x2].insert(x1);
        artc[x1]=a[x1].size();
        usrc[x2]=u[x2].size();
        artac[x1]++;

unordered_map 比谷歌密集哈希快 100%，它占用的 RAM 比谷歌密集哈希少 30%。

【讨论】：

如果您的密钥具有相当大的大小，您仍然可以通过更改哈希函子来获得更快的速度。出于某种原因，请参阅this question。 +1 很高兴您找到了答案。所以 std::unordered_set 比 Google 密集哈希更适合您的需要。记笔记 =)

以上是关于哈希图和向量很慢的主要内容，如果未能解决你的问题，请参考以下文章