是否有一个Boyer-Moore字符串搜索和快速搜索和替换功能以及Delphi 2010 String(UnicodeString)的快速字符串计数?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了是否有一个Boyer-Moore字符串搜索和快速搜索和替换功能以及Delphi 2010 String(UnicodeString)的快速字符串计数?相关的知识,希望对你有一定的参考价值。
我需要三个快速大字符串函数:快速搜索,快速搜索和替换,以及字符串中子字符串的快速计数。
我在C ++和Python中遇到过Boyer-Moore字符串搜索,但是用于实现快速搜索和替换的唯一Delphi Boyer-Moore算法是由Peter Morris(前身为DroopyEyes软件)及其网站的FastStrings的一部分。和电子邮件不再有效。
我已经将FastStrings移植到Delphi 2009/2010的AnsiStrings中,其中一个字节等于一个AnsiChar,但是它们也可以在Delphi 2010中使用String(UnicodeString),这看起来并不重要。
使用这个Boyer-Moore算法,应该可以轻松地进行不区分大小写的搜索,以及不区分大小写的搜索和替换,没有任何临时字符串(使用StrUpper等),并且不调用比Boyer更慢的Pos()需要在重复搜索同一文本时进行摩尔搜索。
(编辑:我有一个部分解决方案,写作这个问题的答案,几乎100%完成,它甚至有一个快速的字符串替换功能。我相信它必须有bug,特别是认为,因为它假装是Unicode由于未实现的Unicode承诺,它必须存在故障。)
(编辑2:有趣和意外的结果;堆栈上的unicode代码点表的大堆栈大小 - 下面的代码中的SkipTable严重阻碍了你可以在unicode字符串boyer中完成的双赢优化量 - 摩尔字符串搜索。感谢Florent Ouchet指出我应该立即注意到的内容。)
这个答案现在已经完成并且适用于区分大小写的模式,但不适用于不区分大小写的模式,并且可能还有其他错误,因为它没有经过单元测试,并且可能会进一步优化,例如我重复了本地函数__SameChar而不是使用本来比较快的比较函数回调,实际上,允许用户传递所有这些的比较函数对于想要提供一些额外逻辑的Unicode用户来说是很好的(某些语言的等效Unicode字形集合) )。
根据Dorin Dominica的代码,我构建了以下内容。
{ _FindStringBoyer:
Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if bCaseSensitive then
Result := (sString[StringIndex] = sPattern[PatternIndex])
else
Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
var
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
begin
if fromPos < 1 then
raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
LengthPattern := Length(sPattern);
LengthString := Length(sString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[sPattern[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[sPattern[LengthPattern]];
SkipTable[sPattern[LengthPattern]] := Large;
Index := fromPos + LengthPattern -1;
Result := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[sString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
// Found, return.
Result := Index - kIndex + 1;
Index := Index + LengthPattern;
exit;
end else begin
if __SameChar(Index, LengthPattern) then
Index := Index + LastMarker
else
Index := Index + SkipTable[sString[Index]];
end; // if kIndex = LengthPattern then begin
end; // while Index <= LengthString do begin
end;
{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
a substring inside the main string, at a much faster rate than we
could have done otherwise. Another thing that would be great is
to have a function that returns an array of find-locations,
which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
foundPos:Integer;
fromPos:Integer;
Limit:Integer;
guard:Integer;
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if CaseSensitive then
Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
else
Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
result := 0;
foundPos := 1;
fromPos := 1;
Limit := Length(aSourceString);
guard := Length(aFindString);
Index := 0;
LengthPattern := Length(aFindString);
LengthString := Length(aSourceString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[aFindString[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[aFindString[LengthPattern]];
SkipTable[aFindString[LengthPattern]] := Large;
while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin
Index := fromPos + LengthPattern -1;
if Index>Limit then
break;
kIndex := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[aSourceString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
// Found, return.
//Result := Index - kIndex + 1;
Index := Index + LengthPattern;
fromPos := Index;
Inc(Result);
break;
end else begin
if __SameChar(Index, LengthPattern) then
Index := Index + LastMarker
else
Index := Index + SkipTable[aSourceString[Index]];
end; // if kIndex = LengthPattern then begin
end; // while Index <= LengthString do begin
end;
end;
这真是一个很好的算法,因为:
- 以这种方式计算字符串Y中子串X的实例的速度更快,非常好。
- 仅仅替换Pos(),_FindStringBoyer()比FastCode项目人员为Delphi贡献的纯asm版本更快,目前用于Pos,如果你需要不区分大小写,你可以想象性能当我们不必在100兆字节的字符串上调用UpperCase时提升。 (好吧,你的琴弦不会那么大。但是,高效算法仍然是一种美丽的东西。)
好吧,我用Boyer-Moore风格写了一个字符串替换:
function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
errors:Integer;
fromPos:Integer;
Limit:Integer;
guard:Integer;
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
CaseSensitive:Boolean;
foundAt:Integer;
lastFoundAt:Integer;
copyStartsAt:Integer;
copyLen:Integer;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if CaseSensitive then
Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
else
Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
result := '';
lastFoundAt := 0;
fromPos := 1;
errors := 0;
CaseSensitive := rfIgnoreCase in Flags;
Limit := Length(aSourceString);
guard := Length(aFindString);
Index := 0;
LengthPattern := Length(aFindString);
LengthString := Length(aSourceString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[aFindString[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[aFindString[LengthPattern]];
SkipTable[aFindString[LengthPattern]] := Large;
while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin
Index := fromPos + LengthPattern -1;
if Index>Limit then
break;
kIndex := 0;
foundAt := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[aSourceString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
foundAt := Index - kIndex + 1;
Index := Index + LengthPattern;
//fromPos := Index;
fromPos := (foundAt+LengthPattern);
if lastFoundAt=0 then begin
copyStartsAt := 1;
copyLen := foundAt-copyStartsAt;
end else begin
copyStartsAt := lastFoundAt+LengthPattern;
copyLen := foundAt-copyStartsAt;
end;
if (copyLen<=0)or(copySt以上是关于是否有一个Boyer-Moore字符串搜索和快速搜索和替换功能以及Delphi 2010 String(UnicodeString)的快速字符串计数?的主要内容,如果未能解决你的问题,请参考以下文章
python Boyer-Moore字符串搜索算法:Python实现
哪个是更好的字符串搜索算法? Boyer-Moore 还是 Boyer Moore Horspool? [关闭]