PairWise 匹配数百万条记录

Posted 2023-04-18

技术标签:

【中文标题】PairWise 匹配数百万条记录【英文标题】：PairWise matching millions of records 【发布时间】：2013-11-21 09:14:01 【问题描述】：

我手头有一个算法问题。为了方便地解释这个问题，我将使用一个简单的类比。我有一个输入文件

Country,Exports
Austrailia,Sheep
US, Apple
Austrialia,Beef

最终目标：我必须找到这对国家之间的共同产品，所以

"Austrailia,New Zealand":"apple","sheep
"Austrialia,US":"apple"
"New Zealand","US":"apple","milk"

流程：

我读入输入并将其存储在 TreeMap > 列表中，由于许多重复，字符串被保留。本质上，我是按国家/地区汇总的。其中 Key 是国家/地区，Values 是其出口。

"austrailia":"apple","sheep","koalas"
"new zealand":"apple","sheep","milk"
"US":"apple","beef","milk"

我有大约 1200 个键（国家），值（出口）的总数是 8000 万。我对每个键的所有值进行排序：

"austrailia":"apple","sheep","koalas" -- > "austrailia":"apple","koalas","sheep"

这很快，因为只有 1200 个列表要排序。

for(k1:keys)
   for(k2:keys)
        if(k1.compareTo(k2) <0) //Dont want to double compare
    List<String> intersectList = intersectList_func(k1's exports,k2's exports);
        countriespair.put(k1,k2,intersectList)

这个代码块需要很长时间。我意识到它是 O(n2) 和大约 1200*1200 的比较。因此，到现在运行了将近 3 个小时.. 有什么办法，我可以加快或优化它。明智的算法是最好的选择，或者还有其他技术可以考虑。

编辑： 由于两个 List 都是预先排序的，因此 intersectList 是 O(n)，其中 n 是 floor(listOne.length,listTwo.length) 的长度，而不是 O(n2)，如下所述

private static List<String> intersectList(List<String> listOne,List<String> listTwo)
        int i=0,j=0;
        List<String> listResult = new LinkedList<String>(); 
        while(i!=listOne.size() && j!=listTwo.size())
            int compareVal = listOne.get(i).compareTo(listTwo.get(j));
            if(compareVal==0)
                listResult.add(listOne.get(i));
                i++;j++;               
            else if(compareVal < 0) i++;
            else if (compareVal >0) j++;   
        
        return listResult;

11 月 22 日更新 我当前的实现仍然运行了将近 18 个小时。 :|

11 月 25 日更新 我已经按照 Vikram 和其他一些人的建议运行了新的实现。本周五一直在运行。我的问题是，按出口而非国家/地区分组如何节省计算复杂性。我发现复杂性是一样的。正如 Groo 提到的，我发现第二部分的复杂性是 O(E*C^2)，其中 E 是出口，C 是国家/地区。

【问题讨论】：

使用 SQL 数据库和查询将是一个可能的解决方案。 @prog_guy 提供您的输入文件以测试我的代码 【参考方案1】：

这可以在一个语句中使用 SQL 作为自连接来完成：

测试数据。首先创建一个测试数据集：

Lines <- "Country,Exports
Austrailia,Sheep
Austrailia,Apple
New Zealand,Apple
New Zealand,Sheep
New Zealand,Milk
US,Apple
US,Milk
"
DF <- read.csv(text = Lines, as.is = TRUE)

sqldf 现在我们有DF 发出这个命令：

library(sqldf)
sqldf("select a.Country, b.Country, group_concat(Exports) Exports
   from DF a, DF b using (Exports) 
   where a.Country < b.Country
   group by a.Country, b.Country
")

给出这个输出：

      Country     Country     Exports
1  Austrailia New Zealand Sheep,Apple
2  Austrailia          US       Apple
3 New Zealand          US  Apple,Milk

带索引如果速度太慢，请在 Country 列中添加索引（并确保不要忘记 main. 部分：

sqldf(c("create index idx on DF(Country)",
   "select a.Country, b.Country, group_concat(Exports) Exports
   from main.DF a, main.DF b using (Exports) 
   where a.Country < b.Country
   group by a.Country, b.Country
"))

如果内存不足，则添加 dbname = tempfile() sqldf 参数，以便它使用磁盘。

【讨论】：

不错的解决方案。不是真正的数据库人，所以几个qn。 ...... 1）我记得使用（col_name）连接只在两个列相等时才有效，它是如何通过交集工作的？............ .2 ) group by 的最后一行是什么？通过删除它，只返回一行。到目前为止，我已经使用 group by 来汇总结果，反之亦然。 …………我还没有用整个数据集对它进行基准测试。 (1) 是的，我们通过Exports 加入，并且在幸存的国家/地区对中必须是相同的。 (2) 我们正在对group by 定义的组执行group_concat 聚合。【参考方案2】：

存储类似于以下数据结构的内容：-（以下是伪代码）

ValuesSet =
apple = "Austrailia","New Zealand"..
sheep = "Austrailia","New Zealand"..  



for k in ValuesSet 
    for k1 in k.values() 
       for k2 in k.values()   
           if(k1<k2)
              Set(k1,k2).add(k)

时间复杂度：O（具有相似产品的不同对的数量）

注意：我可能错了，但我认为你不能降低这个时间复杂度

以下是针对您的问题的 java 实现：-

public class PairMatching 

    HashMap Country;
    ArrayList CountNames;
    HashMap ProdtoIndex;
    ArrayList ProdtoCount;
    ArrayList ProdNames;
    ArrayList[][] Pairs;

    int products=0;
    int countries=0;


    public void readfile(String filename) 
        try 
            BufferedReader br = new BufferedReader(new FileReader(new File(filename)));
            String line;
            CountNames = new ArrayList();
            Country = new HashMap<String,Integer>();
            ProdtoIndex = new HashMap<String,Integer>();
            ProdtoCount = new ArrayList<ArrayList>();
            ProdNames = new ArrayList();
            products = countries = 0;
            while((line=br.readLine())!=null) 
                String[] s = line.split(",");
                s[0] = s[0].trim();
                s[1] = s[1].trim();
                int k;
                if(!Country.containsKey(s[0])) 
                    CountNames.add(s[0]);
                    Country.put(s[0],countries);
                    k = countries;
                    countries++;
                 
                else 
                    k =(Integer) Country.get(s[0]);
                
                if(!ProdtoIndex.containsKey(s[1])) 
                    ProdNames.add(s[1]);
                    ArrayList n = new ArrayList();
                    ProdtoIndex.put(s[1],products);
                    n.add(k);
                    ProdtoCount.add(n);
                    products++;
                
                else 
                    int ind =(Integer)ProdtoIndex.get(s[1]);
                    ArrayList c =(ArrayList) ProdtoCount.get(ind);
                    c.add(k);
                
            
            System.out.println(CountNames);
            System.out.println(ProdtoCount);
            System.out.println(ProdNames);

         catch (FileNotFoundException ex) 
            Logger.getLogger(PairMatching.class.getName()).log(Level.SEVERE, null, ex);
         catch (IOException ex) 
            Logger.getLogger(PairMatching.class.getName()).log(Level.SEVERE, null, ex);
        


    

    void FindPairs() 
        Pairs = new ArrayList[countries][countries];
        for(int i=0;i<ProdNames.size();i++) 
            ArrayList curr = (ArrayList)ProdtoCount.get(i);
            for(int j=0;j<curr.size();j++) 
                for(int k=j+1;k<curr.size();k++) 
                    int u =(Integer)curr.get(j);
                    int v = (Integer)curr.get(k);
                    //System.out.println(u+","+v);
                    if(Pairs[u][v]==null) 
                        if(Pairs[v][u]!=null)
                            Pairs[v][u].add(i);
                        else 
                            Pairs[u][v] = new ArrayList();
                            Pairs[u][v].add(i);
                        
                    
                    else Pairs[u][v].add(i);
                
            
        
        for(int i=0;i<countries;i++) 
            for(int j=0;j<countries;j++) 
                if(Pairs[i][j]==null)
                    continue;
                ArrayList a = Pairs[i][j];
                System.out.print("\n"+CountNames.get(i)+","+CountNames.get(j)+" : ");
                for(int k=0;k<a.size();k++) 
                    System.out.print(ProdNames.get((Integer)a.get(k))+" ");
                
            
        
    

    public static void main(String[] args) 
       PairMatching pm = new PairMatching();
       pm.readfile("Input data/BigData.txt");
       pm.FindPairs();

【讨论】：

for 循环非常具有误导性。对于 k1 中的 k （ shld 是 k.values() 中的 k1 ，同样适用于 k2 。【参考方案3】：

[更新] 与 OP 的原始算法相比，此处介绍的算法不应提高时间复杂度。 两种算法具有相同的渐近复杂度，并且遍历排序列表（如 OP 所做的那样）通常应该性能比使用哈希表更好。

您需要按product 而非country 对项目进行分组，以便能够快速获取属于某个产品的所有国家/地区。

这将是伪代码：

inputList contains a list of pairs country, product

// group by product 
prepare mapA (product) => (list_of_countries)
for each country, product in inputList
      
   if mapA does not contain (product)
      create a new empty (list_of_countries) 
      and add it to mapA with (product) as key

   add this (country) to the (list_of_countries)


// now group by country_pair  
prepare mapB (country_pair) => (list_of_products)       
for each product, list_of_countries in mapA
   
   for each pair countryA, countryB in list_of_countries
   
      if mapB does not countain country_pair countryA, countryB
         create a new empty (list_of_products) 
         and add it to mapB with country_pair countryA, countryB as key

      add this (product) to the (list_of_products)

如果您的输入列表长度为 N，并且您有 C 个不同的国家和 P 个不同的产品，那么该算法的运行时间应该是第一部分的 O(N) 和第二部分的 O(P*C^2)。由于您的最终列表需要将国家/地区对映射到产品列表，我认为您无论如何都不会失去P*C^2 的复杂性。

我不会用 Java 编写太多代码，所以我添加了一个 C# 示例，我相信您可以轻松移植：

// mapA maps each product to a list of countries
var mapA = new Dictionary<string, List<string>>();
foreach (var t in inputList)

    List<string> countries = null;
    if (!mapA.TryGetValue(t.Product, out countries))
    
        countries = new List<string>();
        mapA[t.Product] = countries;
    
    countries.Add(t.Country);


// note (this is very important):
// CountryPair tuple must have value-type comparison semantics, 
// i.e. you need to ensure that two CountryPairs are compared
// by value to allow hashing (mapping) to work correctly, in O(1).

// In C# you can also simply use a Tuple<string,string> to 
// represent a pair of countries (which implements this correctly),
// but I used a custom class to emphasize the algorithm

// mapB maps each CountryPair to a list of products
var mapB = new Dictionary<CountryPair, List<string>>();
foreach (var kvp in mapA)

    var product = kvp.Key;
    var countries = kvp.Value;

    for (int i = 0; i < countries.Count; i++)
    
        for (int j = i + 1; j < countries.Count; j++)
        
            var pair = CountryPair.Create(countries[i], countries[j]);
            List<string> productsForCountryPair = null;
            if (!mapB.TryGetValue(pair, out productsForCountryPair))
            
                productsForCountryPair = new List<string>();
                mapB[pair] = productsForCountryPair;
            
            productsForCountryPair.Add(product);
        *

【讨论】：

我运行了这个新的实现，我按出口而不是国家来分组。它从周五晚上开始运行。我的问题是，这个新实现如何降低计算复杂度。我发现复杂性几乎相同。列表之间的交集之间的唯一区别。如果我错了，请纠正我。这个算法最重要的部分是正确地实现CountryPair相等/散列功能。这意味着您必须override equals/getHashCode 并确保您的哈希函数不会提供太多冲突。你也可以使用任何以这种方式实现相等的元组类（例如javatuples）。如果你的哈希函数写得不好，那么你将丢失O(1) 查找国家对地图，时间会变短。 @prog_guy：我还通过一些复杂性分析编辑了我的答案，所以你可以检查我的数字是否正确。 你有大约 65,000 个不同的产品 (80M / 1.2k = 66.6k)，我有大约 370 万个不同的产品，而不是 66K。我仍然不同意为什么我的 intersect 方法是 O(n2)，我仍然认为它是 O(n) .. 对于 CountryPair 类，它有 2 个字符串，我现在使用通用 Objects.hash(country1,country2) .. 也会看看 javatuples。 @prog_guy：是的，你是对的，在最坏的情况下 intersect 迭代两个列表直到结束，即O(2n) = O(n)。这确实使两种复杂性相等（除非您考虑到排序部分，O(n log n)，它执行速度很快），很抱歉我第一次没有更好地分析您的代码。您的代码甚至可能表现出更好的引用局部性，但是您可以通过使用 ArrayList 而不是 LinkedList 来减少分配。【参考方案4】：

这是使用Map Reduce 的一个很好的例子。

在您的地图阶段，您只需收集属于每个国家/地区的所有出口。然后，reducer 对产品进行排序（产品属于同一个国家，因为 mapper）

您将受益于可分布到集群中的分布式并行算法。

【讨论】：

如果他对于 8000 万条记录只有 1200 个键，那么 MapReduce 是该问题的最差选择（实际上它取决于所使用的分组算法和实现）。还有80米。根本没有那么大，您需要一个分布式系统。与keys相反的大量记录，实际上是map reduce的一个问题。适当的分组设计将提高程序的性能。此外，硬件（即内核数量）是关键因素。但总的来说，这比 prog_guy 给出的初始解决方案更好。【参考方案5】：

您实际上花费了 O(n^2 * 1 个相交所需的时间)。

让我们看看我们是否可以改善相交时间。我们可以为每个存储相应产品的国家维护地图，因此您有 n 个国家/地区的 n 个哈希地图。只需要遍历所有产品一次即可进行初始化。如果您想快速查找，请将地图维护为：

    HashMap<String,HashMap<String,Boolean>> countryMap = new HashMap<String, HashMap<String,Boolean>>();

现在，如果您想查找国家 str1 和 str2 的常见产品，请执行以下操作：

    HashMap<String,Boolean> map1 = countryMap.get("str1");
    HashMap<String,Boolean> map2 = countryMap.get("str2");

    ArrayList<String > common = new ArrayList<String>();
    Iterator it = map1.entrySet().iterator();
    while (it.hasNext()) 
        Map.Entry<String,Boolean> pairs = (Map.Entry)it.next();

        //Add to common if it is there in other map
        if(map2.containsKey(pairs.getKey()))
            common.add(pairs.getKey());

因此，假设哈希映射查找实现为 O(1)（我猜它是 java 的 log k），如果一个映射中有 k 个条目，总共将是 O(n^2 * k)。

【讨论】：

复杂度是 O(n)，是的，Java 中的查找是 O(1)。您的循环结构可以简化为map1.keySet().retainAll(map2.keySet())。 @ThomasJungblut n 个国家 (nC2) 有 O(n^2) 种组合。如果每个国家/地区有 O(k) 个产品，那么总体复杂性将是 O(n^2 * k)，因为正如您提到的，在 java 中查找是 O(1)。【参考方案6】：

在必要时使用哈希图来加快速度：

1) 浏览数据并创建一个地图，其中包含与该项目相关的国家/地区列表。所以例如绵羊：澳大利亚、美国、英国、新西兰……

2) 创建一个哈希图，其中包含每对国家/地区的键和（最初）一个空列表作为值。

3) 对于每个项目，检索与其关联的国家/地区列表，并为该列表中的每对国家/地区，将该项目添加到在步骤 (2) 中为该对创建的列表中。

4) 现在输出每对国家/地区的更新列表。

最大的成本在步骤 (3) 和 (4) 中，这两个成本都与产出的产量成线性关系，所以我认为这离最优值不远了。

【讨论】：

和我的文字回答一样原来如此——我刚看的时候看到一堆嵌套的 for 循环，没再看多远。我认为我在声称它是最佳的时提供了更多细节。没关系 :)，对于多人支持它的答案总是有好处的。

以上是关于PairWise 匹配数百万条记录的主要内容，如果未能解决你的问题，请参考以下文章

MySQL 最长前缀匹配 100 万条记录与 3000 种可能性

使用 Postgres 全文搜索搜索完全匹配的最佳方法是啥？