在单词中找到最短的重复周期？

Posted 2023-02-19

技术标签:

【中文标题】在单词中找到最短的重复周期？【英文标题】：Finding shortest repeating cycle in word? 【发布时间】：2011-08-26 15:35:05 【问题描述】：

我即将编写一个函数，它将返回最短的一组字母，最终将创建给定的单词。

例如单词 abkebabkebabkeb 是由重复的 abkeb 单词创建的。我想知道，如何有效地分析输入词，以获得创建输入词的最短字符周期。

【问题讨论】：

@Tony The Tiger，结果（最短周期）不一定是真字。 【参考方案1】：

这是一个正确的 O(n) 算法。第一个 for 循环是 KMP 的表构建部分。有各种证据表明它总是在线性时间内运行。

由于这个问题之前有 4 个答案，但都不是 O(n) 且正确的，因此我对这个解决方案的正确性和运行时间进行了大量测试。

def pattern(inputv):
    if not inputv:
        return inputv

    nxt = [0]*len(inputv)
    for i in range(1, len(nxt)):
        k = nxt[i - 1]
        while True:
            if inputv[i] == inputv[k]:
                nxt[i] = k + 1
                break
            elif k == 0:
                nxt[i] = 0
                break
            else:
                k = nxt[k - 1]

    smallPieceLen = len(inputv) - nxt[-1]
    if len(inputv) % smallPieceLen != 0:
        return inputv

    return inputv[0:smallPieceLen]

【讨论】：

那么这是您提出的解决方案还是已知算法？好吧KMP is a known algorithm。这个问题与我的作业问题非常相似，这就是我为作业提出的答案。讲师的解决方案有点不同，但也使用了 KMP。嗨，Buge，喜欢您的解决方案，并投票。但是被smallPieceLen = len(inputv) - nxt[-1]这一行弄糊涂了，nxt[-1]表示如果整个字符串不匹配，接下来我们将使用什么索引来比较。 smallPieceLen表示字符串总长度和nxt[-1]的区别，怎么表示为最短重复字符串？ @LinMa：（Buge 最近不活跃）nxt[-1] means if the whole string does not match, what index we will be used to compare next 没有（歪曲的语法，顺便说一句。）。当所有模式都匹配并且您希望在较长的文本中找到它的下一次出现时，它是接下来要比较的索引。 nxt[i] = p 表示 pattern[i+1-p:i+1] 等于 pattern[0:p]（& != 表示 p+1）。如果“第一个”不匹配是“len+1”，nxt[-1] 是接下来要比较的索引。（在 KMP 的许多演示/实现中，在索引 0 处有一个特殊的值 -1，上面的 n 值“移动到更高的索引”。） @LinMa: （both 无论如何都会被通知）让我打电话给len(inputv) len 和nxt[-1] matchLen。如果 matchLen smallPieceLen，则 smallPieceLen 分割 len 的唯一机会是等于它。如果 smallPieceLen ≤ matchLen，inputv[0:smallPieceLen] 等于 inputv[smallPieceLen:2*smallPieceLen]，并且k 从未被重置（再次）：inputv 由重复的 inputv[0:smallPieceLen] 组成 -可分性检查只是确保它以完全重复结束。【参考方案2】：

这是一个 php 的例子：

<?php
function getrepeatedstring($string) 
    if (strlen($string)<2) return $string;
    for($i = 1; $i<strlen($string); $i++) 
        if (substr(str_repeat(substr($string, 0, $i),strlen($string)/$i+1), 0, strlen($string))==$string)
            return substr($string, 0, $i);
    
    return $string;

?>

【讨论】：

这会返回 'abkeb' 这应该是正确的，但我不确定 OP 以何种方式要求 'kebab' 而不是 'abkeb'。这就是我要找的。但它在 O(n) 中运行。如果可以加快速度，有什么想法吗？ @jack44：在检查整个字符串之前，您无法知道是否有最短周期。除非您有其他知识，例如可能的最大周期是多少。可能是字符串中的最后一个字符将整个循环抛出，你不知道。我不懂 PHP，但这看起来是 O(n^2)。 @Richard86 - 不过，字符串比较需要 O(n)，不是吗？【参考方案3】：

O(n) 解决方案。假设必须覆盖整个字符串。关键的观察是我们生成模式并测试它，但如果我们发现一些不匹配的东西，我们必须包括我们已经测试过的整个字符串，所以我们不必重新观察那些字符。

def pattern(inputv):
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            print pattern_end
            return inputv[0:pattern_end+1];
    return inputv;

【讨论】：

for pattern_end in range(len(inputv)/2) 有必要吗？我不认为是。 @Ishtar - 抱歉我没有关注。你是说 len()/2 部分的样子我的意思是，用pattern_end = 0替换那行。恐怕算法不正确。请考虑输入：“BCBDBCCBCBBC”。最小的重复模式是“BCBDBC”，但上面的算法会错过它。另外，我认为它不能正确处理“HELLOHELL”的情况（它返回“HELLO”而不是完整的字符串）。 @Boris：问题是找到 S 的最小子序列，使得它的 K>=1 重复将导致 S 本身。输入“HELLOHELL”没有K>1的重复子序列，所以应该返回“HELLOHELL”。【参考方案4】：

python中最简单的一个：

def pattern(self, s):
    ans=(s+s).find(s,1,-1)
    return len(pat) if ans == -1 else ans

【讨论】：

如果你解释一下你做了什么会很有帮助【参考方案5】：

我相信有一个非常优雅的递归解决方案。许多提议的解决方案解决了字符串以部分模式结尾的额外复杂性，例如abcabca。但我不认为这是要求。

我对clojure中问题的简单版本的解决方案：

 (defn find-shortest-repeating [pattern string]
  (if (empty? (str/replace string pattern ""))
   pattern
   (find-shortest-repeating (str pattern (nth string (count pattern))) string)))

(find-shortest-repeating "" "abcabcabc") ;; "abc"

但请注意，这不会找到最后不完整的模式。

【讨论】：

【参考方案6】：

我根据您的帖子找到了一个解决方案，它可能采用不完整的模式：

(defn find-shortest-repeating [pattern string]
   (if (or (empty? (clojure.string/split string (re-pattern pattern)))
          (empty? (second (clojure.string/split string (re-pattern pattern)))))
    pattern
    (find-shortest-repeating (str pattern (nth string (count pattern))) string)))

【讨论】：

@ward

(defn find-pattern-string [string]   (let [pattern ""         working-str string]     (reduce        #(if (not (or (empty? (clojure.string/split string (re-pattern %1)))                     (empty? (second (clojure.string/split string (re-pattern %1))))))                (str %1 %2)           %1)        pattern working-str)))

【参考方案7】：

我的解决方案：这个想法是从位置零开始找到一个子字符串，使其等于相同长度的相邻子字符串，当找到这样的子字符串时返回子字符串。请注意，如果没有找到重复的子字符串，我将打印整个输入字符串。

public static void repeatingSubstring(String input)
    for(int i=0;i<input.length();i++)
        if(i==input.length()-1)
            System.out.println("There is no repetition "+input);
        
        else if(input.length()%(i+1)==0)
            int size = i+1;
            if(input.substring(0, i+1).equals(input.substring(i+1, i+1+size)))
                System.out.println("The subString which repeats itself is "+input.substring(0, i+1));
                break;

【讨论】：

我认为字符串“ababcababc”会失败【参考方案8】：

这是我使用队列提出的解决方案，它通过了 codeforces 中类似问题的所有测试用例。问题编号是745A。

#include<bits/stdc++.h>
using namespace std;
typedef long long ll;

int main()

    ios_base::sync_with_stdio(false);
    cin.tie(NULL);

    string s, s1, s2; cin >> s; queue<char> qu; qu.push(s[0]); bool flag = true; int ind = -1;
    s1 = s.substr(0, s.size() / 2);
    s2 = s.substr(s.size() / 2);
    if(s1 == s2)
    
        for(int i=0; i<s1.size(); i++)
        
            s += s1[i];
        
    
    //cout << s1 << " " << s2 << " " << s << "\n";
    for(int i=1; i<s.size(); i++)
    
        if(qu.front() == s[i]) qu.pop();
        qu.push(s[i]);
    
    int cycle = qu.size();

    /*queue<char> qu2 = qu; string str = "";
    while(!qu2.empty())
    
        cout << qu2.front() << " ";
        str += qu2.front();
        qu2.pop();
    */


    while(!qu.empty())
    
        if(s[++ind] != qu.front()) flag = false; break;
        qu.pop();
    
    flag == true ? cout << cycle : cout << s.size();
    return 0;

【讨论】：

【参考方案9】：

我可以在面试中提出的更简单的答案只是一个 O(n^2) 解决方案，它尝试从 0 开始的所有子字符串组合。

int findSmallestUnit(string str)
    for(int i=1;i<str.length();i++)
        int j=0;
        for(;j<str.length();j++)
            if(str[j%i] != str[j])
                break;
            
        
        if(j==str.length()) return str.substr(0,i);
    
    return str;

现在如果有人对 C++ 中这个问题的 O(n) 解决方案感兴趣：

  int findSmallestUnit(string str)
      vector<int> lps(str.length(),0);
      int i=1;
      int len=0;

      while(i<str.length())
          if(str[i] == str[len])
              len++;
              lps[i] = len;
              i++;
          
          else
              if(len == 0) i++;
              else
                  len = lps[len-1];
              
          
      
      int n=str.length();
      int x = lps[n-1];
      if(n%(n-x) == 0)
          return str.substr(0,n-x);    
      
      return str;

以上只是@Buge在c++中的回答，因为有人在cmets中问过。

【讨论】：

【参考方案10】：

正则表达式解决方案：

使用以下正则表达式替换找到最短的重复子字符串，并仅保留该子字符串：

^(.+?)\1*$
$1

解释：

^(.+?)\1*$
^        $   # Start and end, to match the entire input-string
 (   )       # Capture group 1:
  .+         #  One or more characters,
    ?        #  with a reluctant instead of greedy match†
      \1*    # Followed by the first capture group repeated zero or more times

$1           # Replace the entire input-string with the first capture group match,
             # removing all other duplicated substrings

† Greedy vs reluctant 在这种情况下意味着：贪婪 = 消耗尽可能多的字符； relucant = 尽可能少地使用字符。由于我们想要最短的重复子字符串，因此我们希望在我们的正则表达式中进行不情愿的匹配。

示例输入："abkebabkebabkeb" 示例输出："abkeb"

Try it online in Retina.

Here an example implementation in Java.

【讨论】：

【参考方案11】：

超级延迟的答案，但我在面试中得到了这个问题，这是我的答案（可能不是最优化的，但它也适用于奇怪的测试用例）。

private void run(String[] args) throws IOException 
    File file = new File(args[0]);
    BufferedReader buffer = new BufferedReader(new FileReader(file));
    String line;
    while ((line = buffer.readLine()) != null) 
        ArrayList<String> subs = new ArrayList<>();
        String t = line.trim();
        String out = null;
        for (int i = 0; i < t.length(); i++) 
            if (t.substring(0, t.length() - (i + 1)).equals(t.substring(i + 1, t.length()))) 
                subs.add(t.substring(0, t.length() - (i + 1)));
            
        
        subs.add(0, t);
        for (int j = subs.size() - 2; j >= 0; j--) 
            String match = subs.get(j);
            int mLength = match.length();
            if (j != 0 && mLength <= t.length() / 2) 
                if (t.substring(mLength, mLength * 2).equals(match)) 
                    out = match;
                    break;
                
             else 
                out = match;
            
        
        System.out.println(out);

测试用例：

abcabcabcabc bcbcbcbcbcbcbcbcbcbcbcbcbcbc dddddddddddddddddddddd adcdefg bcbdbcbcbdbc 你好

代码返回：

abc 公元前 d adcdefg bcbdbc 喂喂

【讨论】：

只看第一个 for 循环，这是 O(n^2)，因为每个 .equals() 可能需要 n 时间。【参考方案12】：

适用于 bcbdbcbcbdbc 等情况。

function smallestRepeatingString(sequence)
  var currentRepeat = '';
  var currentRepeatPos = 0;

  for(var i=0, ii=sequence.length; i<ii; i++)
    if(currentRepeat[currentRepeatPos] !== sequence[i])
      currentRepeatPos = 0;
      // Add next character available to the repeat and reset i so we don't miss any matches inbetween
      currentRepeat = currentRepeat + sequence.slice(currentRepeat.length, currentRepeat.length+1);
      i = currentRepeat.length-1;
    else
      currentRepeatPos++;
    
    if(currentRepeatPos === currentRepeat.length)
      currentRepeatPos = 0;
    
  

  // If repeat wasn't reset then we didn't find a full repeat at the end.
  if(currentRepeatPos !== 0) return sequence; 

  return currentRepeat;

【讨论】：

这实际上是 O(n^2)。那是因为你用i = currentRepeat.length-1; 将i 重置为更小。因此，使用 10 个字符的字符串 ling 'aaaaaaaaab' 需要 46 次迭代。使用 20 个字符的字符串需要 191 次迭代。【参考方案13】：

我想出了一个简单的解决方案，即使是非常大的字符串也能完美运行。 PHP 实现：

function get_srs($s)
    $hash = md5( $s );
    $i = 0; $p = '';

    do 
        $p .= $s[$i++];
        preg_match_all( "/$p/", $s, $m );
     while ( ! hash_equals( $hash, md5( implode( '', $m[0] ) ) ) );

    return $p;

【讨论】：

如果您能详细说明为什么这样做会很好。提供更多细节有助于整个社区，并有助于获得更多支持。

以上是关于在单词中找到最短的重复周期？的主要内容，如果未能解决你的问题，请参考以下文章

尝试使用数组在字符串中找到最短的句子

如何从 php 数组中删除具有特定值的所有元素？（以尽可能最短的方式）[重复]

245. Shortest Word Distance III 单词可以重复的最短单词距离

根据字母的重复性来计数和打印字母

单词接龙

如何在php中生成最短的字符串？ [复制]