如何更快地在 byte[] 中搜索字节？

Posted 2023-02-16

技术标签:

【中文标题】如何更快地在 byte[] 中搜索字节？【英文标题】：How I can search for byte in byte[] faster? 【发布时间】：2019-10-04 15:36:20 【问题描述】：

我在 InputStream 中进行简单的行号计算（计算 NewLines #10 的数量）

for (int i = 0; i < readBytes ; i++) 
    if ( b[ i + off ] == 10 )                      // New Line (10)
        rowCount++;

我可以做得更快吗？没有一个字节的迭代？可能我正在寻找一些能够使用 CPU 特定指令（simd/sse）的类。

所有代码：

@Override
public int read(byte[] b, int off, int len) throws IOException 

    int readBytes = in.read(b, off, len);

    for (int i = 0; i < readBytes ; i++) 
        hadBytes = true;                                // at least once we read something
        lastByteIsNewLine = false;
        if ( b[ i + off ] == 10 )                      // New Line (10)
            rowCount++;
            lastByteIsNewLine = (i == readBytes - 1);   // last byte in buffer was the newline
        
    

    if ( hadBytes && readBytes == -1 && ! lastByteIsNewLine )    // file is not empty + EOF + last byte was not NewLine
        rowCount++;
    

    return readBytes;

【问题讨论】：

什么是readBytes？请发布其内容 readBytes == 字节大小[] 老实说，没有。没有更快的方法来做到这一点。它已经和汇编语言几乎一样了。 @VGR 你是说 JIT 能够产生一个 vectorized version ，像在汇编中一样一次比较 16+ 个字节？ @thatotherguy：一个好的 JIT 很有可能可以矢量化这个循环。 【参考方案1】：

在我的系统上，只需将 lastByteIsNewLine 和 hasBytes 部分移出循环即可带来约 10% 的改进*：

  public int read(byte[] b, int off, int len) throws IOException 

    int readBytes = in.read(b, off, len);

    for (int i = 0; i < readBytes ; i++) 
      if ( b[ i + off ] == 10 ) 
        rowCount++;
      
    
    hadBytes |= readBytes > 0;
    lastByteIsNewLine = (readBytes > 0 ? b[readBytes+off-1] == 10 : false);

    if ( hadBytes && readBytes == -1 && ! lastByteIsNewLine )  
      rowCount++;
    

    return readBytes;

* 6000ms vs 6700ms 1000 次迭代，从填充任意文本的 ByteArrayInputStream 读取的 10MB 缓冲区。

【讨论】：

一旦设置为true，hadBytes 永远不应设置为false（在后续调用中）。使用hadBytes |= readBytes > 0。【参考方案2】：

我从that other guy's 改进开始，并将数组索引计算和字段访问提升到for 循环之外。

根据我的 JMH 基准，这节省了另一个 25%，“那个人的”实现时钟为 3.6 ms/op，此版本为 2.7 ms/op。（这里，一个操作是读取大约 5000 行随机长度的 ~10 MB ByteArrayInputStream）。

public int read(byte[] buffer, int off, int len) throws IOException 
  int n = in.read(buffer, off, len);
  notEmpty |= n > 0;
  int count = notEmpty && n < 0 && !trailingLineFeed ? 1 : 0;
  trailingLineFeed = (n > 0) && buffer[n + off - 1] == '\n';
  for (int max = off + n, idx = off; idx < max;) 
    if (buffer[idx++] == '\n') ++count;
  
  rowCount += count;
  return n;

真正影响性能的事情：在数组上向后索引。

无关紧要的事情：将值与更易读的 '\n' 而不是 10 进行比较。

令人惊讶的是（无论如何对我来说），仅使用其中一种技巧本身似乎并没有提高性能。它们只是一起使用才有所作为。

【讨论】：

第二行（在 trailingLineFeed = 之前）实际上有点错误int count = ，因为最后一次调用 (n=-1) 破坏了 trailingLineFeed 并使其为假。谢谢。 @DenisZhuravlev 我想我修好了。【参考方案3】：

将readBytes转换成String后就可以轻松搜索了：

String stringBytes = new String(readBytes);

要获取出现次数：

int rowCount = StringUtils.countMatches(stringBytes, "\n");

只知道\n是否包含在readBytes中：

boolean newLineFound = stringBytes.contains("\n");

【讨论】：

刚刚试过这个。它慢了 3 倍。 ``` String stringBytes = new String(b);诠释 i = -1; while ( i 仍然慢 3 倍 String stringBytes = new String(b); rowCount += StringUtils.countOccurrencesOf(stringBytes, "\n"); 我查看了 StringUtils.countOccurrencesOf。它使用 String.indexOf 逐字节执行相同的迭代。【参考方案4】：

好吧，与其尝试加快特定部分的速度（我认为您不能），不如尝试使用不同的方法。这是一个类，可用于在从 InputStream 读取时跟踪行数。

public class RowCounter 
    private static final int LF = 10;
    private int rowCount = 0;
    private int lastByte = 0;

    public int getRowCount() 
        return rowCount;
    

    public void addByte(int b) 
        if (lastByte == LF) 
            rowCount++;
        
        lastByte = b;
    

    public void addBytes(byte[] b, int offset, int length) 
        if (length <= 0) return;
        if (lastByte == LF) rowCount++;

        int lastIndex = offset + length - 1;
        for (int i = offset; i < lastIndex; i++) 
            if (b[i] == LF) rowCount++;
        
        lastByte = b[lastIndex];

那么在读取一个InputStream的时候，就可以这样使用了。

InputStream is = ...;
byte[] b = new byte[...];

int bytesRead;
RowCounter counter = new RowCounter();
while ((bytesRead = is.read(b)) != -1) 
    counter.addBytes(b, 0, bytesRead);

int rowCount = counter.getRowCount();

或者您可以轻松地将其调整为您需要的任何情况。

【讨论】：

以上是关于如何更快地在 byte[] 中搜索字节？的主要内容，如果未能解决你的问题，请参考以下文章