在字符串中搜索未知模式的最有效方法?
Posted
技术标签:
【中文标题】在字符串中搜索未知模式的最有效方法?【英文标题】:Most efficient way to search for unknown patterns in a string? 【发布时间】:2017-10-22 15:59:04 【问题描述】:我正在尝试寻找以下模式:
不止一次发生 长度超过 1 个字符 不是任何其他已知模式的子字符串不知道可能发生的任何模式。
例如:
字符串“the boy fall by the bell”将返回'ell', 'the b', 'y '
。
字符串“男孩倒在铃旁,男孩倒在铃旁”将返回'the boy fell by the bell'
。
使用双 for 循环,它可能会被暴力强制非常效率低下:
ArrayList<String> patternsList = new ArrayList<>();
int length = string.length();
for (int i = 0; i < length; i++)
int limit = (length - i) / 2;
for (int j = limit; j >= 1; j--)
int candidateEndIndex = i + j;
String candidate = string.substring(i, candidateEndIndex);
if(candidate.length() <= 1)
continue;
if (string.substring(candidateEndIndex).contains(candidate))
boolean notASubpattern = true;
for (String pattern : patternsList)
if (pattern.contains(candidate))
notASubpattern = false;
break;
if (notASubpattern)
patternsList.add(candidate);
但是,在搜索包含大量模式的大字符串时,这非常慢。
【问题讨论】:
某种意义上,这是一种压缩形式。您可能会对各种压缩算法进行一些研究。 为什么在您的第一个结果示例中单个空格不是元素? @Björn 因为它只有一个字符长。 当然/me 擦眼镜 为什么“,”是一个带空格的逗号,不是您的第二个结果示例的一部分? 【参考方案1】:您可以在线性时间内为您的字符串构建后缀树: https://en.wikipedia.org/wiki/Suffix_tree
您要查找的模式是与只有叶子节点的内部节点相对应的字符串。
【讨论】:
【参考方案2】:您可以使用 n-gram 来查找字符串中的模式。扫描字符串中的 n-gram 需要 O(n) 时间。当您使用 n-gram 找到子字符串时,将其放入哈希表中,并计算在字符串中找到该子字符串的次数。在字符串中搜索完 n-gram 后,在哈希表中搜索大于 1 的计数以查找字符串中的重复模式。
例如,在字符串“男孩倒在铃旁,男孩倒在铃旁”中,使用 6 克将找到子字符串“男孩倒在铃旁”。具有该子字符串的哈希表条目的计数为 2,因为它在字符串中出现了两次。改变 n-gram 中的单词数量将帮助您发现字符串中的不同模式。
Dictionary<string, int>dict = new Dictionary<string, int>();
int count = 0;
int ngramcount = 6;
string substring = "";
// Add entries to the hash table
while (count < str.length)
// copy the words into the substring
int i = 0;
substring = "";
while (ngramcount > 0 && count < str.length)
substring[i] = str[count];
if (str[i] == ' ')
ngramcount--;
i++;
count++;
ngramcount = 6;
substring.Trim(); // get rid of the last blank in the substring
// Update the dictionary (hash table) with the substring
if (dict.Contains(substring)) // substring is already in hash table so increment the count
int hashCount = dict[substring];
hashCount++;
dict[substring] = hashCount;
else
dict[substring] = 1;
// Find the most commonly occurrring pattern in the string
// by searching the hash table for the greatest count.
int maxCount = 0;
string mostCommonPattern = "";
foreach (KeyValuePair<string, int> pair in dict)
if (pair.Value > maxCount)
maxCount = pair.Value;
mostCommonPattern = pair.Key;
【讨论】:
【参考方案3】:我写这个只是为了好玩。我希望我已经正确理解了这个问题,这是有效且足够快的;如果没有,请对我放轻松:) 如果有人觉得它有用,我想我可能会再优化一点。
private static IEnumerable<string> getPatterns(string txt)
char[] arr = txt.ToArray();
BitArray ba = new BitArray(arr.Length);
for (int shingle = getMaxShingleSize(arr); shingle >= 2; shingle--)
char[] arr1 = new char[shingle];
int[] indexes = new int[shingle];
HashSet<int> hs = new HashSet<int>();
Dictionary<int, int[]> dic = new Dictionary<int, int[]>();
for (int i = 0, count = arr.Length - shingle; i <= count; i++)
for (int j = 0; j < shingle; j++)
int index = i + j;
arr1[j] = arr[index];
indexes[j] = index;
int h = getHashCode(arr1);
if (hs.Add(h))
int[] indexes1 = new int[indexes.Length];
Buffer.BlockCopy(indexes, 0, indexes1, 0, indexes.Length * sizeof(int));
dic.Add(h, indexes1);
else
bool exists = false;
foreach (int index in indexes)
if (ba.Get(index))
exists = true;
break;
if (!exists)
int[] indexes1 = dic[h];
if (indexes1 != null)
foreach (int index in indexes1)
if (ba.Get(index))
exists = true;
break;
if (!exists)
foreach (int index in indexes)
ba.Set(index, true);
int[] indexes1 = dic[h];
if (indexes1 != null)
foreach (int index in indexes1)
ba.Set(index, true);
dic[h] = null;
yield return new string(arr1);
private static int getMaxShingleSize(char[] arr)
for (int shingle = 2; shingle <= arr.Length / 2 + 1; shingle++)
char[] arr1 = new char[shingle];
HashSet<int> hs = new HashSet<int>();
bool noPattern = true;
for (int i = 0, count = arr.Length - shingle; i <= count; i++)
for (int j = 0; j < shingle; j++)
arr1[j] = arr[i + j];
int h = getHashCode(arr1);
if (!hs.Add(h))
noPattern = false;
break;
if (noPattern)
return shingle - 1;
return -1;
private static int getHashCode(char[] arr)
unchecked
int hash = (int)2166136261;
foreach (char c in arr)
hash = (hash * 16777619) ^ c.GetHashCode();
return hash;
编辑 我以前的代码有严重的问题。这个更好:
private static IEnumerable<string> getPatterns(string txt)
Dictionary<int, int> dicIndexSize = new Dictionary<int, int>();
for (int shingle = 2, count0 = txt.Length / 2 + 1; shingle <= count0; shingle++)
Dictionary<string, int> dic = new Dictionary<string, int>();
bool patternExists = false;
for (int i = 0, count = txt.Length - shingle; i <= count; i++)
string sub = txt.Substring(i, shingle);
if (!dic.ContainsKey(sub))
dic.Add(sub, i);
else
patternExists = true;
int index0 = dic[sub];
if (index0 >= 0)
dicIndexSize[index0] = shingle;
dic[sub] = -1;
if (!patternExists)
break;
List<int> lst = dicIndexSize.Keys.ToList();
lst.Sort((a, b) => dicIndexSize[b].CompareTo(dicIndexSize[a]));
BitArray ba = new BitArray(txt.Length);
foreach (int i in lst)
bool ok = true;
int len = dicIndexSize[i];
for (int j = i, max = i + len; j < max; j++)
if (ok) ok = !ba.Get(j);
ba.Set(j, true);
if (ok)
yield return txt.Substring(i, len);
this book 中的文本在我的计算机中花费了 3.4 秒。
【讨论】:
嗨@AlexQuilliam。我想知道你是否找到了一个好的解决方案。如果是这样,如果您能添加代码,那就太好了。我很好奇我的代码在最佳解决方案方面的性能和有效性。【参考方案4】:后缀数组是正确的想法,但缺少一个重要的部分,即识别文献中称为“超最大重复”的内容。这是一个带有工作代码的 GitHub 存储库:https://github.com/eisenstatdavid/commonsub。后缀数组构造使用 SAIS 库,作为子模块出售。使用findsmaxr
in Efficient repeat finding via suffix arrays
(Becher–Deymonnaz–Heiber) 中的伪代码的更正版本找到超最大重复。
static void FindRepeatedStrings(void)
// findsmaxr from https://arxiv.org/pdf/1304.0528.pdf
printf("[");
bool needComma = false;
int up = -1;
for (int i = 1; i < Len; i++)
if (LongCommPre[i - 1] < LongCommPre[i])
up = i;
continue;
if (LongCommPre[i - 1] == LongCommPre[i] || up < 0) continue;
for (int k = up - 1; k < i; k++)
if (SufArr[k] == 0) continue;
unsigned char c = Buf[SufArr[k] - 1];
if (Set[c] == i) goto skip;
Set[c] = i;
if (needComma)
printf("\n,");
printf("\"");
for (int j = 0; j < LongCommPre[up]; j++)
unsigned char c = Buf[SufArr[up] + j];
if (iscntrl(c))
printf("\\u%.4x", c);
else if (c == '\"' || c == '\\')
printf("\\%c", c);
else
printf("%c", c);
printf("\"");
needComma = true;
skip:
up = -1;
printf("\n]\n");
这是第一段文本的示例输出:
Davids-MBP:commonsub eisen$ ./repsub input
["\u000a"
," S"
," as "
," co"
," ide"
," in "
," li"
," n"
," p"
," the "
," us"
," ve"
," w"
,"\""
,"–"
,"("
,")"
,". "
,"0"
,"He"
,"Suffix array"
,"`"
,"a su"
,"at "
,"code"
,"com"
,"ct"
,"do"
,"e f"
,"ec"
,"ed "
,"ei"
,"ent"
,"ere's a "
,"find"
,"her"
,"https://"
,"ib"
,"ie"
,"ing "
,"ion "
,"is"
,"ith"
,"iv"
,"k"
,"mon"
,"na"
,"no"
,"nst"
,"ons"
,"or"
,"pdf"
,"ri"
,"s are "
,"se"
,"sing"
,"sub"
,"supermaximal repeats"
,"te"
,"ti"
,"tr"
,"ub "
,"uffix arrays"
,"via"
,"y, "
]
【讨论】:
【参考方案5】:我会使用Knuth–Morris–Pratt algorithm(线性时间复杂度O(n))来查找子字符串。我会尝试找到最大的子字符串模式,将其从输入字符串中删除并尝试找到第二大的,依此类推。我会这样做:
string pattern = input.substring(0,lenght/2);
string toMatchString = input.substring(pattern.length, input.lenght - 1);
List<string> matches = new List<string>();
while(pattern.lenght > 0)
int index = KMP(pattern, toMatchString);
if(index > 0)
matches.Add(pattern);
// remove the matched pattern occurences from the input string
// I would do something like this:
// 0 to pattern.lenght gets removed
// check for all occurences of pattern in toMatchString and remove them
// get the remaing shrinked input, reassign values for pattern & toMatchString
// keep looking for the next largest substring
else
pattern = input.substring(0, pattern.lenght - 1);
toMatchString = input.substring(pattern.length, input.lenght - 1);
KMP
实现了 Knuth–Morris–Pratt 算法。您可以在 Github 或 Princeton 找到它的 Java 实现,或者自己编写。
PS:我不会用 Java 编写代码,我的第一个赏金很快就要结束了。所以如果我错过了一些琐碎的事情或犯了 +/-1 的错误,请不要给我棍子。
【讨论】:
以上是关于在字符串中搜索未知模式的最有效方法?的主要内容,如果未能解决你的问题,请参考以下文章
在 SQL Server 中检查逗号分隔字符串中是不是存在子字符串的最有效方法