如何在 TRIE 中找到最长的字符串

Posted 2023-02-22

技术标签:

【中文标题】如何在 TRIE 中找到最长的字符串【英文标题】：How to find the longest string in a TRIE 【发布时间】：2015-05-30 09:26:33 【问题描述】：

我已经通过 2 个类实现了 DS Trie：Trie 和 TrieNode。我需要编写一个函数，在 O(h) 中返回 Trie 中最长的字符串。我的 TrieNode 有一个 LinkedList 字段，用于存储每个节点的子节点。我们还没有了解 BFS 或 DFS，所以我正在尝试一些创造性的方法来解决它。

我已经有一个函数（一个单独的函数）通过给定的字符插入/创建一个新节点：在构建 Trie 时：创建一个带有字段“maxDepth=0”的节点，该字段指示我当前的深度。对于我创建的每个新节点，我将一直迭代到他的父节点（每个节点都已经有一个指向其父节点的指针），依此类推，直到我到达根节点，并将其父节点的深度增加 1。现在我将通过这种方式创建返回最长字符串的函数：对于每个节点：遍历我的孩子，寻找最大整数“maxDepth”而不是向下。这样做直到你达到'maxDepth==0'。例如，我的算法适用于这个字符串：“aacgace”

       root      
       / \
   (2)a   g(0)     
     / 
 (1)c        
   / 
(0)e

=> 'ace' 实际上是最长的。但不适用于此字符串：“aacgae”

      root      
      /  \
   (2)a   g(0)     
    /  \
 (0)c  (0)e

=> 似乎节点“a”有一个孩子，而他的孩子也有一个孩子，但事实并非如此。

一般来说，我尝试使用创建 Trie 的第一个函数（运行时间：O(h*c)），因此第二个函数（返回最长的字符串）的运行时间会更少我可以。 O(h)

【问题讨论】：

这是一种使用 trie 的相当奇怪的方式... 首先，你尝试的单词是什么？有关系吗？整个项目是使用 Trie 对文本文件进行编码，因此单词可以是任何英文字母。那么我完全不明白你是如何构建你的 trie 的。我确实有一个 trie 实现，看看我是怎么做的。请注意，您也可以尝试使用 radix 树。你不能做一个广度优先遍历到达最深的节点，然后从那里回到根来获取字符串吗？ 【参考方案1】：

不确定您真正想要做什么，但您可以找到一个 trie 示例 here。

基本上我通过构建器来创建特里树；让我们快速了解如何将单词添加到 trie：

// In TrieBuilder
final TrieNodeBuilder nodeBuilder = new TrieNodeBuilder();

// ...

/**
 * Add one word to the trie
 *
 * @param word the word to add
 * @return this
 * @throws IllegalArgumentException word is empty
 */
public TrieBuilder addWord(@Nonnull final String word)

    Objects.requireNonNull(word);

    final int length = word.length();

    if (length == 0)
        throw new IllegalArgumentException("a trie cannot have empty "
            + "strings (use EMPTY instead)");
    nrWords++;
    maxLength = Math.max(maxLength, length);
    nodeBuilder.addWord(word);
    return this;

这推迟了将单词添加到 TrieNodeBuilder，它执行以下操作：

private boolean fullWord = false;

private final Map<Character, TrieNodeBuilder> subnodes
    = new TreeMap<>();

TrieNodeBuilder addWord(final String word)

    doAddWord(CharBuffer.wrap(word));
    return this;


/**
 * Add a word
 *
 * <p>Here also, a @link CharBuffer is used, which changes position as we
 * progress into building the tree, character by character, node by node.
 * </p>
 *
 * <p>If the buffer is "empty" when entering this method, it means a match
 * must be recorded (see @link #fullWord).</p>
 *
 * @param buffer the buffer (never null)
 */
private void doAddWord(final CharBuffer buffer)

    if (!buffer.hasRemaining()) 
        fullWord = true;
        return;
    

    final char c = buffer.get();
    TrieNodeBuilder builder = subnodes.get(c);
    if (builder == null) 
        builder = new TrieNodeBuilder();
        subnodes.put(c, builder);
    
    builder.doAddWord(buffer);

假设我们在 trie 中添加了“trouble”和“trouble”；这是怎么回事：

第一次为“麻烦”的每个字符创建节点；第二次，直到“l”的所有节点都存在；然后为“ing”创建所有节点。

现在，如果我们添加“麻烦”，则会在“e”之后为“s”创建另一个节点。

fullWord 变量告诉我们这里是否有潜在的完全匹配；这是搜索功能：

public final class Trie

    private final int nrWords;
    private final int maxLength;
    private final TrieNode node;

    // ...

    /**
     * Search for a string into this trie
     *
     * @param needle the string to search
     * @return the length of the match (ie, the string) or -1 if not found
     */
    public int search(final String needle)
    
        return node.search(needle);
    
    // ...

在TrieNode 我们有：

public final class TrieNode

    private final boolean fullWord;

    private final char[] nextChars;
    private final TrieNode[] nextNodes;

    // ...

    public int search(final String needle)
    
        return doSearch(CharBuffer.wrap(needle), fullWord ? 0 : -1, 0);
    

    /**
     * Core search method
     *
     * <p>This method uses a @link CharBuffer to perform searches, and changes
     * this buffer's position as the match progresses. The two other arguments
     * are the depth of the current search (ie the number of nodes visited
     * since root) and the index of the last node where a match was found (ie
     * the last node where @link #fullWord was true.</p>
     *
     * @param buffer the charbuffer
     * @param matchedLength the last matched length (-1 if no match yet)
     * @param currentLength the current length walked by the trie
     * @return the length of the match found, -1 otherwise
     */
    private int doSearch(final CharBuffer buffer, final int matchedLength,
        final int currentLength)
    
        /*
         * Try and see if there is a possible match here; there is if "fullword"
         * is true, in this case the next "matchedLength" argument to a possible
         * child call will be the current length.
         */
        final int nextLength = fullWord ? currentLength : matchedLength;


        /*
         * If there is nothing left in the buffer, we have a match.
         */
        if (!buffer.hasRemaining())
            return nextLength;

        /*
         * OK, there is at least one character remaining, so pick it up and see
         * whether it is in the list of our children...
         */
        final int index = Arrays.binarySearch(nextChars, buffer.get());

        /*
         * If not, we return the last good match; if yes, we call this same
         * method on the matching child node with the (possibly new) matched
         * length as an argument and a depth increased by 1.
         */
        return index < 0
            ? nextLength
            : nextNodes[index].doSearch(buffer, nextLength, currentLength + 1);

请注意在第一次调用doSearch() 时如何将-1 作为“nextLength”参数传递。

假设我们有一个包含上述三个单词的 trie，下面是搜索“tr”的调用序列，但失败了：

doSearch("tr", -1, 0)（节点为根）； doSearch("tr", -1, 1) (节点为't'); doSearch("tr", -1, 2) (节点是'r'); 没有下一个字符：返回下一个长度； nextLength 为 -1，不匹配。

现在，如果我们有“麻烦”：

doSearch("troubles", -1, 0)（节点为根）； doSearch("trouble", -1, 1) (节点为't'); doSearch("trouble", -1, 2) (节点是'r'); doSearch("troubles", -1, 3) (节点为'o'); doSearch("troubles", -1, 4) (节点是'u'); doSearch("麻烦", -1, 5) (节点是'b'); doSearch("troubles", -1, 6) (节点是'l'); doSearch("trouble", -1, 7) (节点是'e'); doSearch("troubles", 7, 8) (fullword 是真的！节点是's'); 没有下一个字符：返回nextLength，即8；我们有一场比赛。

【讨论】：

【参考方案2】：

嗯，你的想法是对的——如果你想在不遍历整个树的情况下找到最长的字符串，你必须在构建树时存储一些信息。假设对于节点i，我们将最大长度存储在max_depth[i] 中，并且我们将其最大长度的子节点存储在max_child[i] 中。因此，对于您插入到 trie 中的每个新单词，请记住您插入的最后一个节点（这也是一个新叶子，代表字符串的最后一个字符），请执行以下操作：

current = last_inserted_leaf
while (current != root):
    if max_depth[parent[current]] < max_depth[current] + 1:
        max_depth[parent[current]] = max_depth[current] + 1
        max_child[parent[current]] = current
    current = parent[current]

现在，要输出最长的字符串，只需执行以下操作：

current = root
while is_not_leaf(current):
    answer += char_of_child[max_child[current]]
    current = max_child[current]
return answer

因此，插入需要2*n = O(n) 操作，查找最长字符串需要O(h)，其中h 是最长字符串的长度。

但是，上面描述的算法占用了O(n)额外的内存，而且太多了。最简单的方法是将max_string 存储在某处，每次将字符串添加到trie 时，只需比较new_string 的长度和max_string 的长度，如果新长度更大，然后分配max_string = new_string。它将占用更少的内存，最长的字符串将在 O(1) 中找到。

【讨论】：

以上是关于如何在 TRIE 中找到最长的字符串的主要内容，如果未能解决你的问题，请参考以下文章

如何找到包含两个唯一重复字符的最长子字符串