同时搜索多个HashMap

Posted 2023-02-25

技术标签:

【中文标题】同时搜索多个HashMap【英文标题】：Search multiple HashMaps at the same time 【发布时间】：2015-10-21 17:30:12 【问题描述】：

tldr：如何同时在多个（只读）Java HashMap 中搜索条目？

长版：

我有几个不同大小的字典存储为HashMap< String, String >。一旦它们被读入，它们就永远不会被改变（严格只读）。我想检查是否以及哪个字典使用我的密钥存储了一个条目。

我的代码最初是在寻找这样的密钥：

public DictionaryEntry getEntry(String key) 
    for (int i = 0; i < _numDictionaries; i++) 
        HashMap<String, String> map = getDictionary(i);
        if (map.containsKey(key))
             return new DictionaryEntry(map.get(key), i);
    
    return null;

然后它变得更复杂了：我的搜索字符串可能包含拼写错误，或者是存储条目的变体。就像，如果存储的密钥是“香蕉”，我可能会查找“香蕉”或“香蕉”，但仍然希望返回“香蕉”的条目。使用 Levenshtein-Distance，我现在遍历所有字典和其中的每个条目：

public DictionaryEntry getEntry(String key) 
    for (int i = 0; i < _numDictionaries; i++) 
        HashMap<String, String> map = getDictionary(i);
        for (Map.Entry entry : map.entrySet) 
            // Calculate Levenshtein distance, store closest match etc.
        
    
    // return closest match or null.

到目前为止，一切正常，我得到了我想要的条目。不幸的是，我必须在五个不同大小的字典（约 30 - 70k 个条目）中查找大约 7000 个字符串，这需要一段时间。从我的处理输出来看，我有一个强烈的印象，我的查找支配了整个运行时间。

我改进运行时的第一个想法是并行搜索所有字典。由于不需要更改任何字典，并且同时访问字典的线程不超过一个，因此我看不到任何安全问题。

问题只是：我该怎么做？我以前从未使用过多线程。我的搜索只找到了 Concurrent HashMaps（但据我所知，我不需要这个）和 Runnable 类，我必须将我的处理放入方法 run() 中。我想我可以重写我当前的类以适应 Runnable，但我想知道是否有更简单的方法可以做到这一点（或者我怎么能简单地用 Runnable 来做到这一点，现在我有限的理解认为我必须重组很多)。

自从我被要求分享 Levenshtein-Logic：这真的没什么花哨的，但是给你：

private int _maxLSDistance = 10;
public Map.Entry getClosestMatch(String key) 
    Map.Entry _closestMatch = null;
    int lsDist;

    if (key == null) 
        return null;
    

    for (Map.Entry entry : _dictionary.entrySet()) 
        // Perfect match
        if (entry.getKey().equals(key)) 
            return entry;
        
        // Similar match
        else 
            int dist = StringUtils.getLevenshteinDistance((String) entry.getKey(), key);

            // If "dist" is smaller than threshold and smaller than distance of already stored entry
            if (dist < _maxLSDistance) 
                if (_closestMatch == null || dist < _lsDistance) 
                    _closestMatch = entry;
                    _lsDistance = dist;
                
            
        
    
    return _closestMatch

【问题讨论】：

我建议研究更好的数据分区。对于 Trie 结构来说，这听起来不错。在考虑树时，我想您的意思是如果寻找“香蕉”，我只会考虑以“B”开头的条目，对吧？但是如果我的钥匙是“香蕉”呢？我将如何获得任何点击？您愿意提供您的Levenshtein distance 逻辑吗？可能有助于减少运行时间 @Babel：我编辑了我的文本并添加了 Levenshtein 距离。计算不是我自己写的，只是用了StringUtils。 【参考方案1】：

为了在您的情况下使用多线程，可能是这样的：

“监视器”类，主要是存储结果和协调线程；

public class Results 

    private int nrOfDictionaries = 4; //

    private ArrayList<String> results = new ArrayList<String>();

    public void prepare() 
        nrOfDictionaries = 4;
        results = new ArrayList<String>();
    

    public synchronized void oneDictionaryFinished() 
        nrOfDictionaries--;
        System.out.println("one dictionary finished");
        notifyAll();
    

    public synchronized boolean isReady() throws InterruptedException 

        while (nrOfDictionaries != 0) 
            wait();
        

        return true;
    

    public synchronized void addResult(String result) 
        results.add(result);
    

    public ArrayList<String> getAllResults() 
        return results;

自己的Thread，可以设置搜索特定的字典：

public class ThreadDictionarySearch extends Thread 

    // the actual dictionary
    private String dictionary;
    private Results results;

    public ThreadDictionarySearch(Results results, String dictionary) 
        this.dictionary = dictionary;
        this.results = results;
    

    @Override
    public void run() 

        for (int i = 0; i < 4; i++) 
            // search dictionary;
            results.addResult("result of " + dictionary);
            System.out.println("adding result from " + dictionary);
        

        results.oneDictionaryFinished();

以及演示的主要方法：

public static void main(String[] args) throws Exception 

    Results results = new Results();

    ThreadDictionarySearch threadA = new ThreadDictionarySearch(results, "dictionary A");
    ThreadDictionarySearch threadB = new ThreadDictionarySearch(results, "dictionary B");
    ThreadDictionarySearch threadC = new ThreadDictionarySearch(results, "dictionary C");
    ThreadDictionarySearch threadD = new ThreadDictionarySearch(results, "dictionary D");

    threadA.start();
    threadB.start();
    threadC.start();
    threadD.start();

    if (results.isReady())
    // it stays here until all dictionaries are searched
    // because in "Results" it's told to wait() while not finished;

for (String string : results.getAllResults()) 
        System.out.println("RESULT: " + string);

【讨论】：

如果他想掩盖错别字，那是行不通的。 “Zbanana”比“basfdfsdfsdf”更类似于“banana”，但在排序后的映射中会更远...... 字典（在文本文件中）应该已经排序了。 TreeMap 不会遍历每个条目，SortedMap 也是线程安全的；根据我对树和树图的记忆，如果寻找“香蕉”，我只会/首先考虑以“B”开头的条目，对吗？但是，如果我正在寻找“香蕉”或@cichystefan 建议的“zbanana”并且“A”或“Z”中的条目不会产生任何结果，该怎么办？我是否必须再次遍历所有剩余的条目？是的。如果您能以某种方式使用任何类型的排序结构，您可以按字符串的长度对字符串进行排序，并专注于仅搜索该结构中具有相似长度的条目...【参考方案2】：

我认为最简单的方法是在条目集上使用流：

public DictionaryEntry getEntry(String key) 
  for (int i = 0; i < _numDictionaries; i++) 
    HashMap<String, String> map = getDictionary(i);

    map.entrySet().parallelStream().foreach( (entry) ->
                                     
                                       // Calculate Levenshtein distance, store closest match etc.
                                     
      );
  
  // return closest match or null.

当然，前提是您使用的是 java 8。您也可以将外循环包装成IntStream。也可以直接使用Stream.reduce获取距离最小的入口。

【讨论】：

【参考方案3】：

也许可以试试线程池：

ExecutorService es = Executors.newFixedThreadPool(_numDictionaries);
for (int i = 0; i < _numDictionaries; i++) 
    //prepare a Runnable implementation that contains a logic of your search
    es.submit(prepared_runnable);

我相信您也可以尝试快速估算出完全不匹配的字符串（即长度有显着差异），并用它尽快完成您的逻辑，转移到下一个候选者。

【讨论】：

【参考方案4】：

我非常怀疑 HashMaps 是否是一个合适的解决方案，特别是如果你想要一些模糊和停用词。您应该使用适当的全文搜索解决方案，例如 ElaticSearch 或 Apache Solr，或者至少使用像 Apache Lucene 这样的可用引擎。

话虽如此，您可以使用穷人的版本：创建一个地图数组和一个 SortedMap，遍历该数组，获取当前 HashMap 的键并将它们与它们的 HashMap 的索引一起存储在 SortedMap 中。要检索键，首先在 SortedMap 中搜索所述键，使用索引位置从数组中获取相应的 HashMap，然后仅在一个 HashMap 中查找键。应该足够快，而不需要多个线程来挖掘 HashMap。但是，您可以将下面的代码变成可运行的，并且可以并行进行多个查找。

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.SortedMap;
import java.util.TreeMap;

public class Search 

    public static void main(String[] arg) 

        if (arg.length == 0) 
            System.out.println("Must give a search word!");
            System.exit(1);
        

        String searchString = arg[0].toLowerCase();

        /*
         * Populating our HashMaps.
         */
        HashMap<String, String> english = new HashMap<String, String>();
        english.put("banana", "fruit");
        english.put("tomato", "vegetable");

        HashMap<String, String> german = new HashMap<String, String>();
        german.put("Banane", "Frucht");
        german.put("Tomate", "Gemüse");

        /*
         * Now we create our ArrayList of HashMaps for fast retrieval
         */

        List<HashMap<String, String>> maps = new ArrayList<HashMap<String, String>>();
        maps.add(english);
        maps.add(german);


        /*
         * This is our index
         */
        SortedMap<String, Integer> index = new TreeMap<String, Integer>(String.CASE_INSENSITIVE_ORDER);


        /*
         * Populating the index:
         */
        for (int i = 0; i < maps.size(); i++) 
            // We iterate through or HashMaps...
            HashMap<String, String> currentMap = maps.get(i);

            for (String key : currentMap.keySet()) 
                /* ...and populate our index with lowercase versions of the keys,
                 * referencing the array from which the key originates.
                 */ 
                index.put(key.toLowerCase(), i);
            

        


         // In case our index contains our search string...
        if (index.containsKey(searchString)) 

            /* 
             * ... we find out in which map of the ones stored in maps
             * the word in the index originated from.
             */
            Integer mapIndex = index.get(searchString);

            /*
             * Next, we look up said map.
             */
            HashMap<String, String> origin = maps.get(mapIndex);

            /*
             * Last, we retrieve the value from the origin map
             */

            String result = origin.get(searchString);

            /*
             * The above steps can be shortened to
             *  String result = maps.get(index.get(searchString).intValue()).get(searchString);
             */

            System.out.println(result);
         else 
            System.out.println("\"" + searchString + "\" is not in the index!");

请注意，这是一个相当幼稚的实现，仅用于说明目的。它没有解决几个问题（例如，您不能有重复的索引条目）。

使用此解决方案，您基本上是以启动速度换取查询速度。

【讨论】：

因为我仍在尝试使用字典，所以我觉得添加 ElasticSearch 或 Solr 似乎有点矫枉过正。我现在真正感兴趣的只是，如何并行地做独立的事情。 @fukiburi 请原谅我，但据我了解您的问题，您正在寻找一种有效的方法来查找源自多个只读 HashMap 的键/值对。对我来说，重新发明***似乎有点矫枉过正；）哈哈，是的，根据观点，一种或另一种可能是矫枉过正。我的字典和查询实际上很简单。搜索字符串中可能存在一些错误，但 Levenshtein-Distance 足以涵盖（此处）。现在的重点其实不是完美的入口匹配，我只是想改进运行时间以加快实验速度。【参考方案5】：

好吧！！..

因为您关心的是获得更快的响应。

我建议你在线程之间划分工作。

让您拥有 5 个字典可能将三个字典保留到一个线程，其余两个将由另一个线程处理。然后女巫线程发现匹配将停止或终止另一个线程。

您可能需要额外的逻辑来完成这项工作……但这不会影响您的表演时间。

您可能需要对代码进行更多更改才能获得接近匹配：

for (Map.Entry entry : _dictionary.entrySet())

您正在使用EntrySet 但无论如何您都没有使用值，似乎设置条目有点贵。我建议您只使用keySet，因为您对该地图中的values 并不真正感兴趣

 for (Map.Entry entry : _dictionary.keySet())

有关地图的详细信息，请阅读此链接Map performances

对 LinkedHashMap 的集合视图的迭代需要与地图大小成正比的时间，无论其容量如何。 HashMap 的迭代可能会更昂贵，所需的时间与其容量成正比。

【讨论】：

感谢您提供有关地图性能的信息。我会记住这一点，也许会重新考虑我的算法。

以上是关于同时搜索多个HashMap的主要内容，如果未能解决你的问题，请参考以下文章

同时搜索多个HashMap

9 同时搜索多个index，或多个type

使用多个线程搜索数组，同时不做任何不必要的额外工作

idea全局搜索正则表达式同时匹配多个关键字

怎么在一个word文档中同时搜索多个关键字，并能高亮显示它们？有插件么或者告诉我一个宏？

使用 Rabin-Karp 搜索字符串中的多个模式