byte[] 数组模式搜索
Posted
技术标签:
【中文标题】byte[] 数组模式搜索【英文标题】:byte[] array pattern search 【发布时间】:2010-09-21 23:00:33 【问题描述】:任何人都知道一种在 byte[] 数组中搜索/匹配字节模式然后返回位置的好方法。
例如
byte[] pattern = new byte[] 12,3,5,76,8,0,6,125;
byte[] toBeSearched = new byte[] 23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125
【问题讨论】:
【参考方案1】:您可以将字节数组放入String 并通过IndexOf 运行匹配。或者你至少可以在字符串匹配上重用existing algorithms。
[STAThread]
static void Main(string[] args)
byte[] pattern = new byte[] 12,3,5,76,8,0,6,125;
byte[] toBeSearched = new byte[] 23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125;
string needle, haystack;
unsafe
fixed(byte * p = pattern)
needle = new string((SByte *) p, 0, pattern.Length);
// fixed
fixed (byte * p2 = toBeSearched)
haystack = new string((SByte *) p2, 0, toBeSearched.Length);
// fixed
int i = haystack.IndexOf(needle, 0);
System.Console.Out.WriteLine(i);
【讨论】:
您的代码仅在第一次出现时出现,但问题暗示所有匹配... 我很高兴它有效。如果 ASCII 覆盖了整个 8 位,那么你的代码就更干净了。 不,ASCII不覆盖整个8位,是7位。 使用 UTF-8 是个坏主意: 1. Assert.AreNotEqual(new byte[] 0xc2, 0x00 , Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(new byte[] 0xc2, 0x00 ))); 2.您在字符串中打印索引而不是在字节数组中(多字节字符)【参考方案2】:toBeSearched.Except(pattern) 将返回差异 toBeSearched.Intersect(pattern) 将产生一组交点 通常,您应该查看 Linq 扩展中的扩展方法
【讨论】:
【参考方案3】:我的解决方案:
class Program
public static void Main()
byte[] pattern = new byte[] 12,3,5,76,8,0,6,125;
byte[] toBeSearched = new byte[] 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125;
List<int> positions = SearchBytePattern(pattern, toBeSearched);
foreach (var item in positions)
Console.WriteLine("Pattern matched at pos 0", item);
static public List<int> SearchBytePattern(byte[] pattern, byte[] bytes)
List<int> positions = new List<int>();
int patternLength = pattern.Length;
int totalLength = bytes.Length;
byte firstMatchByte = pattern[0];
for (int i = 0; i < totalLength; i++)
if (firstMatchByte == bytes[i] && totalLength - i >= patternLength)
byte[] match = new byte[patternLength];
Array.Copy(bytes, i, match, 0, patternLength);
if (match.SequenceEqual<byte>(pattern))
positions.Add(i);
i += patternLength - 1;
return positions;
【讨论】:
为什么是array.copy?这样会变慢。我猜这只是因为您想使用 SequenceEqual,但这可能只是因为您想使用扩展方法而需要做很多工作。 “i += patternLength - 1;”部分很好! 你不应该仅仅因为解决方案不完美就给每个人 -1 ......在这种情况下,你应该只投票给你认为最好的解决方案。 这不会错过重叠模式吗? (例如,BOB 只会在 BOBOB 中找到一次) 如果你在 foreach 循环之前坚持 byte[] 分配,你的速度可能会加快一点,因为模式长度在整个循环内总是保持不变。【参考方案4】:我能否建议一些不涉及创建字符串、复制数组或不安全代码的方法:
using System;
using System.Collections.Generic;
static class ByteArrayRocks
static readonly int[] Empty = new int[0];
public static int[] Locate (this byte[] self, byte[] candidate)
if (IsEmptyLocate(self, candidate))
return Empty;
var list = new List<int>();
for (int i = 0; i < self.Length; i++)
if (!IsMatch(self, i, candidate))
continue;
list.Add(i);
return list.Count == 0 ? Empty : list.ToArray();
static bool IsMatch (byte[] array, int position, byte[] candidate)
if (candidate.Length > (array.Length - position))
return false;
for (int i = 0; i < candidate.Length; i++)
if (array[position + i] != candidate[i])
return false;
return true;
static bool IsEmptyLocate (byte[] array, byte[] candidate)
return array == null
|| candidate == null
|| array.Length == 0
|| candidate.Length == 0
|| candidate.Length > array.Length;
static void Main()
var data = new byte[] 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 ;
var pattern = new byte[] 12, 3, 5, 76, 8, 0, 6, 125 ;
foreach (var position in data.Locate(pattern))
Console.WriteLine(position);
编辑(由 IAbstract) - 在此处移动 post 的内容,因为它不是答案
出于好奇,我创建了一个包含不同答案的小型基准测试。
这是一百万次迭代的结果:
solution [Locate]: 00:00:00.7714027
solution [FindAll]: 00:00:03.5404399
solution [SearchBytePattern]: 00:00:01.1105190
solution [MatchBytePattern]: 00:00:03.0658212
【讨论】:
您的解决方案在大字节数组上运行缓慢。 看起来不错 - 我将 Locate 方法更改为返回 IEnumerable这是我的(不是最高效的)解决方案。它依赖于字节/latin-1 转换是无损的这一事实,而对于字节/ASCII 或字节/UTF8 转换,不是正确的。
它的优点是它适用于任何字节值(其他一些解决方案在字节 0x80-0xff 上无法正常工作)并且可以扩展以执行更高级的正则表达式 匹配。
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
class C
public static void Main()
byte[] data = 0, 100, 0, 255, 100, 0, 100, 0, 255;
byte[] pattern = 0, 255;
foreach (int i in FindAll(data, pattern))
Console.WriteLine(i);
public static IEnumerable<int> FindAll(
byte[] haystack,
byte[] needle
)
// bytes <-> latin-1 conversion is lossless
Encoding latin1 = Encoding.GetEncoding("iso-8859-1");
string sHaystack = latin1.GetString(haystack);
string sNeedle = latin1.GetString(needle);
for (Match m = Regex.Match(sHaystack, Regex.Escape(sNeedle));
m.Success; m = m.NextMatch())
yield return m.Index;
【讨论】:
你不应该对这样的东西使用字符串和正则表达式,这只是在滥用它们。 戴维,你的评论非常主观。正则表达式是用于模式匹配的 the 工具,.NET 实现不直接接受字节数组不是我的错。顺便说一句,一些正则表达式库没有这个限制。【参考方案6】:我会使用通过转换为字符串来进行匹配的解决方案...
您应该编写一个简单的函数来实现Knuth-Morris-Pratt searching algorithm。这将是您可以用来查找正确索引的最快的简单算法。(您可以使用Boyer-Moore,但需要更多设置。
优化算法后,您可以尝试寻找其他类型的优化。但你应该从基础开始。
例如,目前“最快”的是 Jb Evian 的 Locate 解决方案。
如果你看核心
for (int i = 0; i < self.Length; i++)
if (!IsMatch (self, i, candidate))
continue;
list.Add (i);
在子算法匹配后,它将开始在 i + 1 处找到匹配,但您已经知道第一个可能的匹配将是 i + Candidate.Length。所以如果你添加,
i += candidate.Length -2; // -2 instead of -1 because the i++ will add the last index
当您期望超集中的子集出现很多时,它会快得多。 (Bruno Conde 已经在他的解决方案中这样做了)
但这只是 KNP 算法的一半,您还应该在 IsMatch 方法中添加一个名为 numberOfValidMatches 的额外参数,这将是一个输出参数。
这将解决以下问题:
int validMatches = 0;
if (!IsMatch (self, i, candidate, out validMatches))
i += validMatches - 1; // -1 because the i++ will do the last one
continue;
和
static bool IsMatch (byte [] array, int position, byte [] candidate, out int numberOfValidMatches)
numberOfValidMatches = 0;
if (candidate.Length > (array.Length - position))
return false;
for (i = 0; i < candidate.Length; i++)
if (array [position + i] != candidate [i])
return false;
numberOfValidMatches++;
return true;
稍作重构,您可以使用 numberOfValidMatches 作为循环变量,并使用 while 重写 Locate 循环以避免 -2 和 -1。但我只是想说明如何添加 KMP 算法。
【讨论】:
"但你已经知道第一个可能的匹配是 i + Candidate.Length" - 这不是真的 - 候选模式可能有重复或循环,这可能导致重叠匹配。 这就是问题所在,在我看来,您只想要完整的非重叠匹配。只有当候选数组末尾的一个或多个字节与候选数组的第一个字节匹配时,这种情况才有可能数组。【参考方案7】:使用高效的Boyer-Moore algorithm。
它旨在查找带有字符串的字符串,但您几乎不需要想象力即可将其投影到字节数组。
一般来说,最好的答案是:使用任何你喜欢的字符串搜索算法:)。
【讨论】:
【参考方案8】:Jb Evain 的回答是:
for (int i = 0; i < self.Length; i++)
if (!IsMatch (self, i, candidate))
continue;
list.Add (i);
然后 IsMatch 函数首先检查candidate
是否超出了正在搜索的数组的长度。
如果for
循环被编码,这将更有效:
for (int i = 0, n = self.Length - candidate.Length + 1; i < n; ++i)
if (!IsMatch (self, i, candidate))
continue;
list.Add (i);
此时,可以从IsMatch
开始就消除测试,只要您通过前置条件约定从不使用“非法”参数调用它。
注意:修复了 2019 年的一个错误。
【讨论】:
*** 的唯一问题是出现问题时,但你打算怎么做呢?我不知道。这已经存在 10 多年了,但它有一个错误。这是一个很好的优化,但它有一个问题。一个接一个。对。想象一下self.Length=1,canidate.Length=1,即使相同,也找不到匹配的。我会尝试改变它。 @Cameron 很明显 - 编辑已获批准,但稍作改动。【参考方案9】:我使用我的答案和 Alnitak 的答案中的提示创建了一个新函数。
public static List<Int32> LocateSubset(Byte[] superSet, Byte[] subSet)
if ((superSet == null) || (subSet == null))
throw new ArgumentNullException();
if ((superSet.Length < subSet.Length) || (superSet.Length == 0) || (subSet.Length == 0))
return new List<Int32>();
var result = new List<Int32>();
Int32 currentIndex = 0;
Int32 maxIndex = superSet.Length - subSet.Length;
while (currentIndex < maxIndex)
Int32 matchCount = CountMatches(superSet, currentIndex, subSet);
if (matchCount == subSet.Length)
result.Add(currentIndex);
currentIndex++;
if (matchCount > 0)
currentIndex += matchCount - 1;
return result;
private static Int32 CountMatches(Byte[] superSet, int startIndex, Byte[] subSet)
Int32 currentOffset = 0;
while (currentOffset < subSet.Length)
if (superSet[startIndex + currentOffset] != subSet[currentOffset])
break;
currentOffset++;
return currentOffset;
我唯一不满意的是
currentIndex++;
if (matchCount > 0)
currentIndex += matchCount - 1;
部分...我想使用 if else 来避免 -1,但这会产生更好的分支预测(尽管我不确定它是否会那么重要)..
【讨论】:
【参考方案10】:为什么要让简单变得困难?这可以使用 for 循环在任何语言中完成。这是 C# 中的一个:
使用系统; 使用 System.Collections.Generic; 命名空间 BinarySearch 课堂节目 静态无效主要(字符串 [] 参数) 字节 [] 模式 = 新字节 [] 12,3,5,76,8,0,6,125; byte[] toBeSearched = new byte[] 23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89 ,76,64,12,3,5,76,8,0,6,125; List【讨论】:
你的幼稚算法的运行时间为O(needle.Length * haystack.Length)
,优化算法的运行时间为O(needle.Length + haystack.Length)
。【参考方案11】:
感谢您抽出宝贵时间...
这是我在问问题之前正在使用/测试的代码... 我问这个问题的原因是我确定我没有使用最佳代码来执行此操作......所以再次感谢您抽出宝贵时间!
private static int CountPatternMatches(byte[] pattern, byte[] bytes)
int counter = 0;
for (int i = 0; i < bytes.Length; i++)
if (bytes[i] == pattern[0] && (i + pattern.Length) < bytes.Length)
for (int x = 1; x < pattern.Length; x++)
if (pattern[x] != bytes[x+i])
break;
if (x == pattern.Length -1)
counter++;
i = i + pattern.Length;
return counter;
有人看到我的代码中有任何错误吗?这被认为是一种骇人听闻的方法吗? 我已经尝试了你们发布的几乎所有样本,我似乎在比赛结果中得到了一些变化。我一直在使用 ~10Mb 字节数组作为我的 toBeSearched 数组来运行我的测试。
【讨论】:
【参考方案12】:最初我发布了一些我使用的旧代码,但对 Jb Evain 的 benchmarks 感到好奇。我发现我的解决方案是愚蠢的慢。 bruno conde 的SearchBytePattern 似乎是最快的。我不知道为什么,尤其是因为他使用了 Array.Copy 和 Extension 方法。但是在 Jb 的测试中有证据,所以对 bruno 表示敬意。
我进一步简化了这些位,因此希望这将是最清晰和最简单的解决方案。 (bruno conde 完成的所有辛勤工作)增强功能包括:
Buffer.BlockCopy Array.IndexOf转换为扩展方法
public static List<int> IndexOfSequence(this byte[] buffer, byte[] pattern, int startIndex)
List<int> positions = new List<int>();
int i = Array.IndexOf<byte>(buffer, pattern[0], startIndex);
while (i >= 0 && i <= buffer.Length - pattern.Length)
byte[] segment = new byte[pattern.Length];
Buffer.BlockCopy(buffer, i, segment, 0, pattern.Length);
if (segment.SequenceEqual<byte>(pattern))
positions.Add(i);
i = Array.IndexOf<byte>(buffer, pattern[0], i + 1);
return positions;
请注意,while
块中的最后一条语句应该是 i = Array.IndexOf<byte>(buffer, pattern[0], i + 1);
而不是 i = Array.IndexOf<byte>(buffer, pattern[0], i + pattern.Length);
。看看约翰的评论。一个简单的测试可以证明:
byte[] pattern = new byte[] 1, 2;
byte[] toBeSearched = new byte[] 1, 1, 2, 1, 12 ;
使用i = Array.IndexOf<byte>(buffer, pattern[0], i + pattern.Length);
,没有返回任何内容。 i = Array.IndexOf<byte>(buffer, pattern[0], i + 1);
返回正确的结果。
【讨论】:
行 "i = Array.IndexOf速度不是一切。你检查过它们的一致性吗?
我没有测试这里列出的所有代码。我测试了我自己的代码(我承认这并不完全一致)和 IndexOfSequence。我发现对于许多测试 IndexOfSequence 比我的代码快很多,但是通过重复测试我发现它不太一致。特别是在数组末尾寻找模式似乎最麻烦,但有时它也会在数组中间错过它们。
我的测试代码不是为了提高效率而设计的,我只是想有一堆随机数据,里面有一些已知的字符串。该测试模式大致类似于 http 表单上传流中的边界标记。这就是我在遇到此代码时一直在寻找的东西,所以我想我会用我将要搜索的那种数据来测试它。看起来模式越长,IndexOfSequence 丢失值的可能性就越大。
private static void TestMethod()
Random rnd = new Random(DateTime.Now.Millisecond);
string Pattern = "-------------------------------65498495198498";
byte[] pattern = Encoding.ASCII.GetBytes(Pattern);
byte[] testBytes;
int count = 3;
for (int i = 0; i < 100; i++)
StringBuilder TestString = new StringBuilder(2500);
TestString.Append(Pattern);
byte[] buf = new byte[1000];
rnd.NextBytes(buf);
TestString.Append(Encoding.ASCII.GetString(buf));
TestString.Append(Pattern);
rnd.NextBytes(buf);
TestString.Append(Encoding.ASCII.GetString(buf));
TestString.Append(Pattern);
testBytes = Encoding.ASCII.GetBytes(TestString.ToString());
List<int> idx = IndexOfSequence(ref testBytes, pattern, 0);
if (idx.Count != count)
Console.Write("change from 0 to 1 on iteration 2: ", count, idx.Count, i);
foreach (int ix in idx)
Console.Write("0, ", ix);
Console.WriteLine();
count = idx.Count;
Console.WriteLine("Press ENTER to exit");
Console.ReadLine();
(显然,我将 IndexOfSequence 从扩展转换回用于此测试的常规方法)
这是我的输出的示例运行:
change from 3 to 2 on iteration 1: 0, 2090,
change from 2 to 3 on iteration 2: 0, 1045, 2090,
change from 3 to 2 on iteration 3: 0, 1045,
change from 2 to 3 on iteration 4: 0, 1045, 2090,
change from 3 to 2 on iteration 6: 0, 2090,
change from 2 to 3 on iteration 7: 0, 1045, 2090,
change from 3 to 2 on iteration 11: 0, 2090,
change from 2 to 3 on iteration 12: 0, 1045, 2090,
change from 3 to 2 on iteration 14: 0, 2090,
change from 2 to 3 on iteration 16: 0, 1045, 2090,
change from 3 to 2 on iteration 17: 0, 1045,
change from 2 to 3 on iteration 18: 0, 1045, 2090,
change from 3 to 1 on iteration 20: 0,
change from 1 to 3 on iteration 21: 0, 1045, 2090,
change from 3 to 2 on iteration 22: 0, 2090,
change from 2 to 3 on iteration 23: 0, 1045, 2090,
change from 3 to 2 on iteration 24: 0, 2090,
change from 2 to 3 on iteration 25: 0, 1045, 2090,
change from 3 to 2 on iteration 26: 0, 2090,
change from 2 to 3 on iteration 27: 0, 1045, 2090,
change from 3 to 2 on iteration 43: 0, 1045,
change from 2 to 3 on iteration 44: 0, 1045, 2090,
change from 3 to 2 on iteration 48: 0, 1045,
change from 2 to 3 on iteration 49: 0, 1045, 2090,
change from 3 to 2 on iteration 50: 0, 2090,
change from 2 to 3 on iteration 52: 0, 1045, 2090,
change from 3 to 2 on iteration 54: 0, 1045,
change from 2 to 3 on iteration 57: 0, 1045, 2090,
change from 3 to 2 on iteration 62: 0, 1045,
change from 2 to 3 on iteration 63: 0, 1045, 2090,
change from 3 to 2 on iteration 72: 0, 2090,
change from 2 to 3 on iteration 73: 0, 1045, 2090,
change from 3 to 2 on iteration 75: 0, 2090,
change from 2 to 3 on iteration 76: 0, 1045, 2090,
change from 3 to 2 on iteration 78: 0, 1045,
change from 2 to 3 on iteration 79: 0, 1045, 2090,
change from 3 to 2 on iteration 81: 0, 2090,
change from 2 to 3 on iteration 82: 0, 1045, 2090,
change from 3 to 2 on iteration 85: 0, 2090,
change from 2 to 3 on iteration 86: 0, 1045, 2090,
change from 3 to 2 on iteration 89: 0, 2090,
change from 2 to 3 on iteration 90: 0, 1045, 2090,
change from 3 to 2 on iteration 91: 0, 2090,
change from 2 to 1 on iteration 92: 0,
change from 1 to 3 on iteration 93: 0, 1045, 2090,
change from 3 to 1 on iteration 99: 0,
我并不是要选择 IndexOfSequence,它恰好是我今天开始使用的那个。我注意到在一天结束时它似乎缺少数据中的模式,所以我今晚编写了自己的模式匹配器。虽然它没有那么快。在发布之前,我将对其进行更多调整,看看是否可以 100% 保持一致。
我只是想提醒大家,在您信任生产代码之前,他们应该测试这样的东西,以确保它们提供良好、可重复的结果。
【讨论】:
【参考方案14】:我尝试了各种解决方案,最终修改了 SearchBytePattern 之一。我在 30k 序列上进行了测试,速度很快 :)
static public int SearchBytePattern(byte[] pattern, byte[] bytes)
int matches = 0;
for (int i = 0; i < bytes.Length; i++)
if (pattern[0] == bytes[i] && bytes.Length - i >= pattern.Length)
bool ismatch = true;
for (int j = 1; j < pattern.Length && ismatch == true; j++)
if (bytes[i + j] != pattern[j])
ismatch = false;
if (ismatch)
matches++;
i += pattern.Length - 1;
return matches;
让我知道你的想法。
【讨论】:
【参考方案15】:这些是您可以使用的最简单、最快的方法,没有比这些更快的方法了。这是不安全的,但这就是我们使用指针的目的是速度。所以在这里我为您提供我的扩展方法,我使用搜索单个和出现的索引列表。我想说这是这里最干净的代码。
public static unsafe long IndexOf(this byte[] Haystack, byte[] Needle)
fixed (byte* H = Haystack) fixed (byte* N = Needle)
long i = 0;
for (byte* hNext = H, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
bool Found = true;
for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
if (Found) return i;
return -1;
public static unsafe List<long> IndexesOf(this byte[] Haystack, byte[] Needle)
List<long> Indexes = new List<long>();
fixed (byte* H = Haystack) fixed (byte* N = Needle)
long i = 0;
for (byte* hNext = H, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
bool Found = true;
for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
if (Found) Indexes.Add(i);
return Indexes;
以 Locate 为基准,速度提高 1.2-1.4 倍
【讨论】:
它确实是不安全的,因为它搜索超过针尾的大海捞针。请参阅下面的版本。【参考方案16】:这是我想出的解决方案。我包括了我在实施过程中发现的笔记。它可以向前、向后和不同的(in/dec)remement 数量匹配,例如方向;从大海捞针中的任何偏移量开始。
任何输入都会很棒!
/// <summary>
/// Matches a byte array to another byte array
/// forwards or reverse
/// </summary>
/// <param name="a">byte array</param>
/// <param name="offset">start offset</param>
/// <param name="len">max length</param>
/// <param name="b">byte array</param>
/// <param name="direction">to move each iteration</param>
/// <returns>true if all bytes match, otherwise false</returns>
internal static bool Matches(ref byte[] a, int offset, int len, ref byte[] b, int direction = 1)
#region Only Matched from offset Within a and b, could not differ, e.g. if you wanted to mach in reverse for only part of a in some of b that would not work
//if (direction == 0) throw new ArgumentException("direction");
//for (; offset < len; offset += direction) if (a[offset] != b[offset]) return false;
//return true;
#endregion
//Will match if b contains len of a and return a a index of positive value
return IndexOfBytes(ref a, ref offset, len, ref b, len) != -1;
///Here is the Implementation code
/// <summary>
/// Swaps two integers without using a temporary variable
/// </summary>
/// <param name="a"></param>
/// <param name="b"></param>
internal static void Swap(ref int a, ref int b)
a ^= b;
b ^= a;
a ^= b;
/// <summary>
/// Swaps two bytes without using a temporary variable
/// </summary>
/// <param name="a"></param>
/// <param name="b"></param>
internal static void Swap(ref byte a, ref byte b)
a ^= b;
b ^= a;
a ^= b;
/// <summary>
/// Can be used to find if a array starts, ends spot Matches or compltely contains a sub byte array
/// Set checkLength to the amount of bytes from the needle you want to match, start at 0 for forward searches start at hayStack.Lenght -1 for reverse matches
/// </summary>
/// <param name="a">Needle</param>
/// <param name="offset">Start in Haystack</param>
/// <param name="len">Length of required match</param>
/// <param name="b">Haystack</param>
/// <param name="direction">Which way to move the iterator</param>
/// <returns>Index if found, otherwise -1</returns>
internal static int IndexOfBytes(ref byte[] needle, ref int offset, int checkLength, ref byte[] haystack, int direction = 1)
//If the direction is == 0 we would spin forever making no progress
if (direction == 0) throw new ArgumentException("direction");
//Cache the length of the needle and the haystack, setup the endIndex for a reverse search
int needleLength = needle.Length, haystackLength = haystack.Length, endIndex = 0, workingOffset = offset;
//Allocate a value for the endIndex and workingOffset
//If we are going forward then the bound is the haystackLength
if (direction >= 1) endIndex = haystackLength;
#region [Optomization - Not Required]
//
//I though this was required for partial matching but it seems it is not needed in this form
//workingOffset = needleLength - checkLength;
//
#endregion
else Swap(ref workingOffset, ref endIndex);
#region [Optomization - Not Required]
//
//Otherwise we are going in reverse and the endIndex is the needleLength - checkLength
//I though the length had to be adjusted but it seems it is not needed in this form
//endIndex = needleLength - checkLength;
//
#endregion
#region [Optomized to above]
//Allocate a value for the endIndex
//endIndex = direction >= 1 ? haystackLength : needleLength - checkLength,
//Determine the workingOffset
//workingOffset = offset > needleLength ? offset : needleLength;
//If we are doing in reverse swap the two
//if (workingOffset > endIndex) Swap(ref workingOffset, ref endIndex);
//Else we are going in forward direction do the offset is adjusted by the length of the check
//else workingOffset -= checkLength;
//Start at the checkIndex (workingOffset) every search attempt
#endregion
//Save the checkIndex (used after the for loop is done with it to determine if the match was checkLength long)
int checkIndex = workingOffset;
#region [For Loop Version]
///Optomized with while (single op)
///for (int checkIndex = workingOffset; checkIndex < endIndex; offset += direction, checkIndex = workingOffset)
///
///Start at the checkIndex
/// While the checkIndex < checkLength move forward
/// If NOT (the needle at the checkIndex matched the haystack at the offset + checkIndex) BREAK ELSE we have a match continue the search
/// for (; checkIndex < checkLength; ++checkIndex) if (needle[checkIndex] != haystack[offset + checkIndex]) break; else continue;
/// If the match was the length of the check
/// if (checkIndex == checkLength) return offset; //We are done matching
///
#endregion
//While the checkIndex < endIndex
while (checkIndex < endIndex)
for (; checkIndex < checkLength; ++checkIndex) if (needle[checkIndex] != haystack[offset + checkIndex]) break; else continue;
//If the match was the length of the check
if (checkIndex == checkLength) return offset; //We are done matching
//Move the offset by the direction, reset the checkIndex to the workingOffset
offset += direction; checkIndex = workingOffset;
//We did not have a match with the given options
return -1;
【讨论】:
【参考方案17】:我参加聚会有点晚了 如何使用 Boyer Moore 算法但搜索字节而不是字符串。 c#代码如下。
EyeCode 公司
class Program
static void Main(string[] args)
byte[] text = new byte[] 12,3,5,76,8,0,6,125,23,36,43,76,125,56,34,234,12,4,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,123;
byte[] pattern = new byte[] 12,3,5,76,8,0,6,125;
BoyerMoore tmpSearch = new BoyerMoore(pattern,text);
Console.WriteLine(tmpSearch.Match());
Console.ReadKey();
public class BoyerMoore
private static int ALPHABET_SIZE = 256;
private byte[] text;
private byte[] pattern;
private int[] last;
private int[] match;
private int[] suffix;
public BoyerMoore(byte[] pattern, byte[] text)
this.text = text;
this.pattern = pattern;
last = new int[ALPHABET_SIZE];
match = new int[pattern.Length];
suffix = new int[pattern.Length];
/**
* Searches the pattern in the text.
* returns the position of the first occurrence, if found and -1 otherwise.
*/
public int Match()
// Preprocessing
ComputeLast();
ComputeMatch();
// Searching
int i = pattern.Length - 1;
int j = pattern.Length - 1;
while (i < text.Length)
if (pattern[j] == text[i])
if (j == 0)
return i;
j--;
i--;
else
i += pattern.Length - j - 1 + Math.Max(j - last[text[i]], match[j]);
j = pattern.Length - 1;
return -1;
/**
* Computes the function last and stores its values in the array last.
* last(Char ch) = the index of the right-most occurrence of the character ch
* in the pattern;
* -1 if ch does not occur in the pattern.
*/
private void ComputeLast()
for (int k = 0; k < last.Length; k++)
last[k] = -1;
for (int j = pattern.Length-1; j >= 0; j--)
if (last[pattern[j]] < 0)
last[pattern[j]] = j;
/**
* Computes the function match and stores its values in the array match.
* match(j) = min s | 0 < s <= j && p[j-s]!=p[j]
* && p[j-s+1]..p[m-s-1] is suffix of p[j+1]..p[m-1] ,
* if such s exists, else
* min s | j+1 <= s <= m
* && p[0]..p[m-s-1] is suffix of p[j+1]..p[m-1] ,
* if such s exists,
* m, otherwise,
* where p is the pattern and m is its length.
*/
private void ComputeMatch()
/* Phase 1 */
for (int j = 0; j < match.Length; j++)
match[j] = match.Length;
//O(m)
ComputeSuffix(); //O(m)
/* Phase 2 */
//Uses an auxiliary array, backwards version of the KMP failure function.
//suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],
//if there is no such j, suffix[i] = m
//Compute the smallest shift s, such that 0 < s <= j and
//p[j-s]!=p[j] and p[j-s+1..m-s-1] is suffix of p[j+1..m-1] or j == m-1,
// if such s exists,
for (int i = 0; i < match.Length - 1; i++)
int j = suffix[i + 1] - 1; // suffix[i+1] <= suffix[i] + 1
if (suffix[i] > j) // therefore pattern[i] != pattern[j]
match[j] = j - i;
else // j == suffix[i]
match[j] = Math.Min(j - i + match[i], match[j]);
/* Phase 3 */
//Uses the suffix array to compute each shift s such that
//p[0..m-s-1] is a suffix of p[j+1..m-1] with j < s < m
//and stores the minimum of this shift and the previously computed one.
if (suffix[0] < pattern.Length)
for (int j = suffix[0] - 1; j >= 0; j--)
if (suffix[0] < match[j]) match[j] = suffix[0];
int j = suffix[0];
for (int k = suffix[j]; k < pattern.Length; k = suffix[k])
while (j < k)
if (match[j] > k)
match[j] = k;
j++;
/**
* Computes the values of suffix, which is an auxiliary array,
* backwards version of the KMP failure function.
*
* suffix[i] = the smallest j > i s.t. p[j..m-1] is a prefix of p[i..m-1],
* if there is no such j, suffix[i] = m, i.e.
* p[suffix[i]..m-1] is the longest prefix of p[i..m-1], if suffix[i] < m.
*/
private void ComputeSuffix()
suffix[suffix.Length-1] = suffix.Length;
int j = suffix.Length - 1;
for (int i = suffix.Length - 2; i >= 0; i--)
while (j < suffix.Length - 1 && !pattern[j].Equals(pattern[i]))
j = suffix[j + 1] - 1;
if (pattern[j] == pattern[i])
j--;
suffix[i] = j + 1;
【讨论】:
【参考方案18】:这是我仅使用基本数据类型编写的简单代码: (它返回第一次出现的索引)
private static int findMatch(byte[] data, byte[] pattern)
if(pattern.length > data.length)
return -1;
for(int i = 0; i<data.length ;)
int j;
for(j=0;j<pattern.length;j++)
if(pattern[j]!=data[i])
break;
i++;
if(j==pattern.length)
System.out.println("Pattern found at : "+(i - pattern.length ));
return i - pattern.length ;
if(j!=0)continue;
i++;
return -1;
【讨论】:
你回答的开头让我想起了一首歌:Here's a little code I wrote, you might want to see it node for node, don't worry, be happy
【参考方案19】:
我错过了 LINQ 方法/答案 :-)
/// <summary>
/// Searches in the haystack array for the given needle using the default equality operator and returns the index at which the needle starts.
/// </summary>
/// <typeparam name="T">Type of the arrays.</typeparam>
/// <param name="haystack">Sequence to operate on.</param>
/// <param name="needle">Sequence to search for.</param>
/// <returns>Index of the needle within the haystack or -1 if the needle isn't contained.</returns>
public static IEnumerable<int> IndexOf<T>(this T[] haystack, T[] needle)
if ((needle != null) && (haystack.Length >= needle.Length))
for (int l = 0; l < haystack.Length - needle.Length + 1; l++)
if (!needle.Where((data, index) => !haystack[l + index].Equals(data)).Any())
yield return l;
【讨论】:
【参考方案20】:使用 LINQ 方法。
public static IEnumerable<int> PatternAt(byte[] source, byte[] pattern)
for (int i = 0; i < source.Length; i++)
if (source.Skip(i).Take(pattern.Length).SequenceEqual(pattern))
yield return i;
非常简单!
【讨论】:
但不是特别有效,因此适用于大多数情况,但不是全部。【参考方案21】:对于 O(n) 类型的 在不使用不安全代码或复制部分源数组的情况下进行操作。
一定要测试。在这个主题上发现的一些建议很容易受到情况的影响。
static void Main(string[] args)
// 1 1 1 1 1 1 1 1 1 1 2 2 2
// 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
byte[] buffer = new byte[] 1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 5, 5, 0, 5, 5, 1, 2 ;
byte[] beginPattern = new byte[] 1, 0, 2 ;
byte[] middlePattern = new byte[] 8, 9, 10 ;
byte[] endPattern = new byte[] 9, 10, 11 ;
byte[] wholePattern = new byte[] 1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ;
byte[] noMatchPattern = new byte[] 7, 7, 7 ;
int beginIndex = ByteArrayPatternIndex(buffer, beginPattern);
int middleIndex = ByteArrayPatternIndex(buffer, middlePattern);
int endIndex = ByteArrayPatternIndex(buffer, endPattern);
int wholeIndex = ByteArrayPatternIndex(buffer, wholePattern);
int noMatchIndex = ByteArrayPatternIndex(buffer, noMatchPattern);
/// <summary>
/// Returns the index of the first occurrence of a byte array within another byte array
/// </summary>
/// <param name="buffer">The byte array to be searched</param>
/// <param name="pattern">The byte array that contains the pattern to be found</param>
/// <returns>If buffer contains pattern then the index of the first occurrence of pattern within buffer; otherwise, -1</returns>
public static int ByteArrayPatternIndex(byte[] buffer, byte[] pattern)
if (buffer != null && pattern != null && pattern.Length <= buffer.Length)
int resumeIndex;
for (int i = 0; i <= buffer.Length - pattern.Length; i++)
if (buffer[i] == pattern[0]) // Current byte equals first byte of pattern
resumeIndex = 0;
for (int x = 1; x < pattern.Length; x++)
if (buffer[i + x] == pattern[x])
if (x == pattern.Length - 1) // Matched the entire pattern
return i;
else if (resumeIndex == 0 && buffer[i + x] == pattern[0]) // The current byte equals the first byte of the pattern so start here on the next outer loop iteration
resumeIndex = i + x;
else
if (resumeIndex > 0)
i = resumeIndex - 1; // The outer loop iterator will increment so subtract one
else if (x > 1)
i += (x - 1); // Advance the outer loop variable since we already checked these bytes
break;
return -1;
/// <summary>
/// Returns the indexes of each occurrence of a byte array within another byte array
/// </summary>
/// <param name="buffer">The byte array to be searched</param>
/// <param name="pattern">The byte array that contains the pattern to be found</param>
/// <returns>If buffer contains pattern then the indexes of the occurrences of pattern within buffer; otherwise, null</returns>
/// <remarks>A single byte in the buffer array can only be part of one match. For example, if searching for 1,2,1 in 1,2,1,2,1 only zero would be returned.</remarks>
public static int[] ByteArrayPatternIndex(byte[] buffer, byte[] pattern)
if (buffer != null && pattern != null && pattern.Length <= buffer.Length)
List<int> indexes = new List<int>();
int resumeIndex;
for (int i = 0; i <= buffer.Length - pattern.Length; i++)
if (buffer[i] == pattern[0]) // Current byte equals first byte of pattern
resumeIndex = 0;
for (int x = 1; x < pattern.Length; x++)
if (buffer[i + x] == pattern[x])
if (x == pattern.Length - 1) // Matched the entire pattern
indexes.Add(i);
else if (resumeIndex == 0 && buffer[i + x] == pattern[0]) // The current byte equals the first byte of the pattern so start here on the next outer loop iteration
resumeIndex = i + x;
else
if (resumeIndex > 0)
i = resumeIndex - 1; // The outer loop iterator will increment so subtract one
else if (x > 1)
i += (x - 1); // Advance the outer loop variable since we already checked these bytes
break;
if (indexes.Count > 0)
return indexes.ToArray();
return null;
【讨论】:
你的解决方案不是 O(n),因为你已经嵌套了!【参考方案22】:我上面 Foubar 的回答版本,它避免搜索超出干草堆的末端,并允许指定起始偏移量。假设针不是空的或比干草堆长。
public static unsafe long IndexOf(this byte[] haystack, byte[] needle, long startOffset = 0)
fixed (byte* h = haystack) fixed (byte* n = needle)
for (byte* hNext = h + startOffset, hEnd = h + haystack.LongLength + 1 - needle.LongLength, nEnd = n + needle.LongLength; hNext < hEnd; hNext++)
for (byte* hInc = hNext, nInc = n; *nInc == *hInc; hInc++)
if (++nInc == nEnd)
return hNext - h;
return -1;
【讨论】:
我在另一个答案中使用了您的 IndexOf 代码(并为您提供了那篇文章的功劳)。只是想你可能想知道 - 你可以在这里找到它:***.com/questions/31364114/…【参考方案23】:你可以使用 ORegex:
var oregex = new ORegex<byte>("012", x=> x==12, x=> x==3, x=> x==5);
var toSearch = new byte[]1,1,12,3,5,1,12,3,5,5,5,5;
var found = oregex.Matches(toSearch);
将找到两个匹配项:
i:2;l:3
i:6;l:3
复杂性:在最坏的情况下为 O(n*m),在现实生活中它是 O(n),因为内部状态机。在某些情况下,它比 .NET Regex 更快。它结构紧凑、速度快,专为数组模式匹配而设计。
【讨论】:
【参考方案24】:这是我的建议,更简单更快捷:
int Search(byte[] src, byte[] pattern)
int maxFirstCharSlot = src.Length - pattern.Length + 1;
for (int i = 0; i < maxFirstCharSlot; i++)
if (src[i] != pattern[0]) // compare only first byte
continue;
// found a match on first byte, now try to match rest of the pattern
for (int j = pattern.Length - 1; j >= 1; j--)
if (src[i + j] != pattern[j]) break;
if (j == 1) return i;
return -1;
这段代码背后的逻辑是这样的:首先它只搜索第一个字节(这是关键的改进),当找到第一个字节时,我尝试匹配模式的其余部分
【讨论】:
其实我不懂逻辑。但是比我尝试的上述一些方法要快。 我只检查第一个字节,然后找到匹配项,检查其余的模式。只检查整数而不是字节可能会更快 一个 necro 评论:您可能应该将“c”重命名为更好的名称 - 例如“maxFirstCharSlot”或其他名称。但这得到了我的 +1 - 非常有用。 虽然由于死灵而正在更新,但这是一个绝对惊人的代码答案,您能否解释它的工作原理或评论逻辑,以便高级成员无法理解,我只知道这是在做什么,因为我的编程学位涵盖了建筑分类和搜索系统:D @Barkermn01 感谢您的评论,我已经编辑了解释其中逻辑的答案,请检查并告诉我是否足够【参考方案25】:我试图理解 Sanchez 的提议并加快搜索速度。以下代码的性能几乎相同。但代码更易于理解。
public int Search3(byte[] src, byte[] pattern)
int index = -1;
for (int i = 0; i < src.Length; i++)
if (src[i] != pattern[0])
continue;
else
bool isContinoue = true;
for (int j = 1; j < pattern.Length; j++)
if (src[++i] != pattern[j])
isContinoue = true;
break;
if(j == pattern.Length - 1)
isContinoue = false;
if ( ! isContinoue)
index = i-( pattern.Length-1) ;
break;
return index;
【讨论】:
【参考方案26】:这是我自己对这个话题的看法。我使用指针来确保它在更大的数组上更快。此函数将返回序列的第一次出现(这是我自己的情况所需要的)。
我相信您可以稍微修改一下,以便返回包含所有出现次数的列表。
我所做的相当简单。我循环遍历源数组(干草堆),直到找到模式的第一个字节(针)。当找到第一个字节时,我会继续单独检查下一个字节是否与模式的下一个字节匹配。如果没有,我会继续正常搜索,从我之前所在的索引(大海捞针)开始,然后再尝试匹配针。
代码如下:
public unsafe int IndexOfPattern(byte[] src, byte[] pattern)
fixed(byte *srcPtr = &src[0])
fixed (byte* patternPtr = &pattern[0])
for (int x = 0; x < src.Length; x++)
byte currentValue = *(srcPtr + x);
if (currentValue != *patternPtr) continue;
bool match = false;
for (int y = 0; y < pattern.Length; y++)
byte tempValue = *(srcPtr + x + y);
if (tempValue != *(patternPtr + y))
match = false;
break;
match = true;
if (match)
return x;
return -1;
以下安全代码:
public int IndexOfPatternSafe(byte[] src, byte[] pattern)
for (int x = 0; x < src.Length; x++)
byte currentValue = src[x];
if (currentValue != pattern[0]) continue;
bool match = false;
for (int y = 0; y < pattern.Length; y++)
byte tempValue = src[x + y];
if (tempValue != pattern[y])
match = false;
break;
match = true;
if (match)
return x;
return -1;
【讨论】:
【参考方案27】:前几天我遇到了这个问题,试试这个:
public static long FindBinaryPattern(byte[] data, byte[] pattern)
using (MemoryStream stream = new MemoryStream(data))
return FindBinaryPattern(stream, pattern);
public static long FindBinaryPattern(string filename, byte[] pattern)
using (FileStream stream = new FileStream(filename, FileMode.Open))
return FindBinaryPattern(stream, pattern);
public static long FindBinaryPattern(Stream stream, byte[] pattern)
byte[] buffer = new byte[1024 * 1024];
int patternIndex = 0;
int read;
while ((read = stream.Read(buffer, 0, buffer.Length)) > 0)
for (int bufferIndex = 0; bufferIndex < read; ++bufferIndex)
if (buffer[bufferIndex] == pattern[patternIndex])
++patternIndex;
if (patternIndex == pattern.Length)
return stream.Position - (read - bufferIndex) - pattern.Length + 1;
else
patternIndex = 0;
return -1;
它没有做任何聪明的事,保持简单。
【讨论】:
【参考方案28】:如果您使用的是 .NET Core 2.1 或更高版本(或 .NET Standard 2.1 或更高版本平台),您可以使用 MemoryExtensions.IndexOf
扩展方法和 new Span
type:
int matchIndex = toBeSearched.AsSpan().IndexOf(pattern);
要查找所有匹配项,您可以使用以下内容:
public static IEnumerable<int> IndexesOf(this byte[] haystack, byte[] needle,
int startIndex = 0, bool includeOverlapping = false)
int matchIndex = haystack.AsSpan(startIndex).IndexOf(needle);
while (matchIndex >= 0)
yield return startIndex + matchIndex;
startIndex += matchIndex + (includeOverlapping ? 1 : needle.Length);
matchIndex = haystack.AsSpan(startIndex).IndexOf(needle);
不幸的是,implementation in .NET Core 2.1 - 3.0 使用迭代的“在第一个字节上优化单字节搜索然后检查余数”方法而不是 fast string search algorithm,但这可能会在未来的版本中改变。 (见dotnet/runtime#60866。)
【讨论】:
【参考方案29】:我使用一个简单的泛型方法
void Main()
Console.WriteLine(new[]255,1,3,4,8,99,92,9,0,5,128.Position(new[]9,0));
Console.WriteLine("Philipp".ToArray().Position("il".ToArray()));
Console.WriteLine(new[] "Mo", "Di", "Mi", "Do", "Fr", "Sa", "So","Mo", "Di", "Mi", "Do", "Fr", "Sa", "So".Position(new[] "Fr", "Sa" , 7));
static class Extensions
public static int Position<T>(this T[] source, T[] pattern, int start = 0)
var matchLenght = 0;
foreach (var indexSource in Enumerable.Range(start, source.Length - pattern.Length))
foreach (var indexPattern in Enumerable.Range(0, pattern.Length))
if (source[indexSource + indexPattern].Equals(pattern[indexPattern]))
if (++matchLenght == pattern.Length)
return indexSource;
return -1;
输出:
7
2
11
【讨论】:
以上是关于byte[] 数组模式搜索的主要内容,如果未能解决你的问题,请参考以下文章