StreamReader 和寻找

Posted 2023-03-31

技术标签:

【中文标题】StreamReader 和寻找【英文标题】：c# - StreamReader and seeking 【发布时间】：2011-07-21 05:29:30 【问题描述】：

你可以使用streamreader读取一个普通的文本文件，然后在读取过程中保存当前位置后关闭streamreader，然后再次打开streamreader并从那个位置开始读取吗？

如果不是，我还能用什么来完成相同的情况而不锁定文件？

类似这样的：

 var fs = File.Open(@"C:\testfile.txt", FileMode.Open, FileAccess.Read);
        var sr = new StreamReader(fs);
        Debug.WriteLine(sr.ReadLine());//Prints:firstline
        var pos = fs.Position;
        while (!sr.EndOfStream)
        
            Debug.WriteLine(sr.ReadLine());
        
        fs.Seek(pos, SeekOrigin.Begin);
        Debug.WriteLine(sr.ReadLine());//Prints Nothing, i expect it to print SecondLine.

@lasseespeholt

这是我尝试过的代码

            var position = -1;
        StreamReaderSE sr = new StreamReaderSE(@"c:\testfile.txt");
        Debug.WriteLine(sr.ReadLine());
        position = sr.BytesRead;
        Debug.WriteLine(sr.ReadLine());
        Debug.WriteLine(sr.ReadLine());
        Debug.WriteLine(sr.ReadLine());
        Debug.WriteLine(sr.ReadLine());
        Debug.WriteLine("Wait");
        sr.BaseStream.Seek(position, SeekOrigin.Begin);
        Debug.WriteLine(sr.ReadLine());

【问题讨论】：

【参考方案1】：

我意识到这确实为时已晚，但我自己偶然发现了StreamReader 中的这个令人难以置信的缺陷；使用StreamReader 时无法可靠地寻找的事实。就个人而言，我的具体需求是具有阅读字符的能力，但如果满足某个条件则“备份”；这是我正在解析的一种文件格式的副作用。

使用ReadLine() 不是一个选项，因为它只在非常琐碎的解析工作中有用。我必须支持可配置的记录/行分隔符序列并支持转义分隔符序列。另外，我不想实现自己的缓冲区，所以我可以支持“备份”和转义序列；那应该是StreamReader 的工作。

此方法按需计算底层字节流中的实际位置。它适用于 UTF8、UTF-16LE、UTF-16BE、UTF-32LE、UTF-32BE 和任何单字节编码（例如代码页 1252、437、28591 等），无论是否存在前导码/BOM。此版本不适用于 UTF-7、Shift-JIS 或其他可变字节编码。

当我需要寻找底层流中的任意位置时，我直接设置BaseStream.Position，然后调用DiscardBufferedData() 以使StreamReader 重新同步，以便下一次Read()/Peek() 调用。

还有一个友情提示：不要随意设置BaseStream.Position。如果你将一个字符一分为二，你将使下一个 Read() 无效，对于 UTF-16/-32，你也会使这个方法的结果无效。

public static long GetActualPosition(StreamReader reader)

    System.Reflection.BindingFlags flags = System.Reflection.BindingFlags.DeclaredOnly | System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.GetField;

    // The current buffer of decoded characters
    char[] charBuffer = (char[])reader.GetType().InvokeMember("charBuffer", flags, null, reader, null);

    // The index of the next char to be read from charBuffer
    int charPos = (int)reader.GetType().InvokeMember("charPos", flags, null, reader, null);

    // The number of decoded chars presently used in charBuffer
    int charLen = (int)reader.GetType().InvokeMember("charLen", flags, null, reader, null);

    // The current buffer of read bytes (byteBuffer.Length = 1024; this is critical).
    byte[] byteBuffer = (byte[])reader.GetType().InvokeMember("byteBuffer", flags, null, reader, null);

    // The number of bytes read while advancing reader.BaseStream.Position to (re)fill charBuffer
    int byteLen = (int)reader.GetType().InvokeMember("byteLen", flags, null, reader, null);

    // The number of bytes the remaining chars use in the original encoding.
    int numBytesLeft = reader.CurrentEncoding.GetByteCount(charBuffer, charPos, charLen - charPos);

    // For variable-byte encodings, deal with partial chars at the end of the buffer
    int numFragments = 0;
    if (byteLen > 0 && !reader.CurrentEncoding.IsSingleByte)
    
        if (reader.CurrentEncoding.CodePage == 65001) // UTF-8
        
            byte byteCountMask = 0;
            while ((byteBuffer[byteLen - numFragments - 1] >> 6) == 2) // if the byte is "10xx xxxx", it's a continuation-byte
                byteCountMask |= (byte)(1 << ++numFragments); // count bytes & build the "complete char" mask
            if ((byteBuffer[byteLen - numFragments - 1] >> 6) == 3) // if the byte is "11xx xxxx", it starts a multi-byte char.
                byteCountMask |= (byte)(1 << ++numFragments); // count bytes & build the "complete char" mask
            // see if we found as many bytes as the leading-byte says to expect
            if (numFragments > 1 && ((byteBuffer[byteLen - numFragments] >> 7 - numFragments) == byteCountMask))
                numFragments = 0; // no partial-char in the byte-buffer to account for
        
        else if (reader.CurrentEncoding.CodePage == 1200) // UTF-16LE
        
            if (byteBuffer[byteLen - 1] >= 0xd8) // high-surrogate
                numFragments = 2; // account for the partial character
        
        else if (reader.CurrentEncoding.CodePage == 1201) // UTF-16BE
        
            if (byteBuffer[byteLen - 2] >= 0xd8) // high-surrogate
                numFragments = 2; // account for the partial character
        
    
    return reader.BaseStream.Position - numBytesLeft - numFragments;

当然，这使用反射来获取私有变量，因此存在风险。但是，此方法适用于 .Net 2.0、3.0、3.5、4.0、4.0.3、4.5、4.5.1、4.5.2、4.6 和 4.6.1。除了这个风险之外，唯一的另一个关键假设是底层字节缓冲区是byte[1024]；如果 Microsoft 以错误的方式更改它，则该方法会因 UTF-16/-32 而中断。

已针对填充有 Ažテ?（10 字节：0x41 C5 BE E3 83 86 F0 A3 98 BA）的 UTF-8 文件和填充有 A?（6 字节：0x41 00 01 D8 37 DC）的 UTF-16 文件进行了测试。重点是沿着byte[1024] 边界强制分割字符，它们可能是所有不同的方式。

更新 (2013-07-03)：我修复了该方法，该方法最初使用了其他答案中的损坏代码。此版本已针对包含需要使用代理对的字符的数据进行测试。数据被放入 3 个文件中，每个文件都有不同的编码；一种 UTF-8、一种 UTF-16LE 和一种 UTF-16BE。

更新 (2016-02)：处理二等分字符的唯一正确方法是直接解释底层字节。 UTF-8 处理得当，UTF-16/-32 可以工作（给定 byteBuffer 的长度）。

【讨论】：

为此我爱你。我整天都在遇到各种奇怪的问题，试图扭转流阅读器的位置，这解决了一个问题！我也将加入这个爱。这是一个救生员不幸的是，我遇到了一些这种实现不起作用的情况（UTF-8 混合了英文和日文字符）。我不得不在每个ReadLine() 上使用CurrentEncoding.GetByteCount() 恢复跟踪我自己的位置。 @MattHouser 你能详细说明一下吗？我用这样的混合字符（包括代理）进行了测试。我忘记了：感谢 Matt Houser 帮助我找出问题；他的样本数据和时间非常有帮助！【参考方案2】：

是的，你可以，看这个：

var sr = new StreamReader("test.txt");
sr.BaseStream.Seek(2, SeekOrigin.Begin); // Check sr.BaseStream.CanSeek first

更新： 请注意，您不一定将sr.BaseStream.Position 用于任何有用的东西，因为StreamReader 使用缓冲区，因此它不会反映您实际阅读的内容。我猜你会很难找到真正的位置。因为您不能只计算字符（不同的编码，因此字符长度）。我认为最好的方法是与FileStream 自己合作。

更新： 从这里使用TGREER.myStreamReader： http://www.daniweb.com/software-development/csharp/threads/35078 此类添加了BytesRead 等（适用于ReadLine()，但显然不适用于其他读取方法）然后你可以这样做：

File.WriteAllText("test.txt", "1234\n56789");

long position = -1;

using (var sr = new myStreamReader("test.txt"))

    Console.WriteLine(sr.ReadLine());

    position = sr.BytesRead;


Console.WriteLine("Wait");

using (var sr = new myStreamReader("test.txt"))

    sr.BaseStream.Seek(position, SeekOrigin.Begin);
    Console.WriteLine(sr.ReadToEnd());

【讨论】：

似乎没问题，但它会锁定文件吗？您可以选择 :) 在此处查看接受的答案：***.com/questions/1606349/… 那不会帮助我保存职位，请检查我的问题中的更新。 @Stacker 我可以看看你的代码吗？它在这里完美运行并输出“1234 wait 56789” @marisks : daniweb 上提出的解决方案不处理多字节 UTF8 字符，因为它计算读取的 字符数 以更新位置。【参考方案3】：

如果您只想在文本流中搜索开始位置，我将这个扩展添加到 StreamReader，以便确定应该在哪里编辑流。当然，这是基于字符作为逻辑的递增方面，但就我的目的而言，它非常有效，可以根据字符串模式在基于文本/ASCII 的文件中获取位置。然后，您可以将该位置用作读取的起点，以写入一个新文件，以排除起点之前的数据。

流中返回的位置可以提供给 Seek 从基于文本的流读取中的该位置开始。有用。我已经测试过了。但是，在匹配算法期间匹配非 ASCII Unicode 字符时可能会出现问题。这是基于美国英语和相关的字符页面。

基础知识：它逐个字符地扫描文本流，仅在流中向前查找顺序字符串模式（与字符串参数匹配）。一旦模式与字符串参数不匹配（即向前，逐个字符），那么它将（从当前位置）重新开始，尝试逐个字符地匹配。如果在流中找不到匹配项，它将最终退出。如果找到匹配项，则根据 StreamReader 所做的缓冲，它返回流中的当前“字符”位置，而不是 StreamReader.BaseStream.Position，因为该位置在前面。

如 cmets 中所示，此方法将影响 StreamReader 的位置，并在方法结束时将其设置回开头 (0)。应该使用 StreamReader.BaseStream.Seek 运行到此扩展返回的位置。

注意：此扩展返回的位置也可以使用 BinaryReader.Seek 作为处理文本文件时的起始位置。在丢弃 PJL 标头信息以使文件成为可以被 GhostScript 使用的“正确”PostScript 可读文件之后，我实际上为此目的使用此逻辑将 PostScript 文件重写回磁盘。 :)

要在 PostScript 中搜索的字符串（在 PJL 标头之后）是：“%!PS-”，后面是“Adobe”和版本。

public static class StreamReaderExtension

    /// <summary>
    /// Searches from the beginning of the stream for the indicated
    /// <paramref name="pattern"/>. Once found, returns the position within the stream
    /// that the pattern begins at.
    /// </summary>
    /// <param name="pattern">The <c>string</c> pattern to search for in the stream.</param>
    /// <returns>If <paramref name="pattern"/> is found in the stream, then the start position
    /// within the stream of the pattern; otherwise, -1.</returns>
    /// <remarks>Please note: this method will change the current stream position of this instance of
    /// <see cref="System.IO.StreamReader"/>. When it completes, the position of the reader will
    /// be set to 0.</remarks>
    public static long FindSeekPosition(this StreamReader reader, string pattern)
    
        if (!string.IsNullOrEmpty(pattern) && reader.BaseStream.CanSeek)
        
            try
            
                reader.BaseStream.Position = 0;
                reader.DiscardBufferedData();
                StringBuilder buff = new StringBuilder();
                long start = 0;
                long charCount = 0;
                List<char> matches = new List<char>(pattern.ToCharArray());
                bool startFound = false;

                while (!reader.EndOfStream)
                
                    char chr = (char)reader.Read();

                    if (chr == matches[0] && !startFound)
                    
                        startFound = true;
                        start = charCount;
                    

                    if (startFound && matches.Contains(chr))
                    
                        buff.Append(chr);

                        if (buff.Length == pattern.Length
                            && buff.ToString() == pattern)
                        
                            return start;
                        

                        bool reset = false;

                        if (buff.Length > pattern.Length)
                        
                            reset = true;
                        
                        else
                        
                            string subStr = pattern.Substring(0, buff.Length);

                            if (buff.ToString() != subStr)
                            
                                reset = true;
                            
                        

                        if (reset)
                        
                            buff.Length = 0;
                            startFound = false;
                            start = 0;
                        
                    

                    charCount++;
                
            
            finally
            
                reader.BaseStream.Position = 0;
                reader.DiscardBufferedData();
            
        

        return -1;

【讨论】：

【参考方案4】：

来自 MSDN：

StreamReader 是为字符设计的以特定编码输入，而 Stream 类是设计的用于字节输入和输出。采用 StreamReader 用于读取来自标准文本文件的信息。

在涉及StreamReader 的大多数示例中，您将看到使用ReadLine() 逐行读取。 Seek 方法来自Stream 类，主要用于读取或处理字节数据。

【讨论】：

标记下来是因为 OP 正在讨论使用 StreamReader 时的搜索。这个答案没有解决寻求和反刍的 MSDN 定义，这是没有用的。【参考方案5】：

FileStream.Position（或等效的 StreamReader.BaseStream.Position）通常会在 TextReader 位置之前（可能远远超过），因为发生了底层缓冲。

如果您可以确定文本文件中换行符的处理方式，则可以根据行长和行尾字符将读取的字节数相加。

File.WriteAllText("test.txt", "1234" + System.Environment.NewLine + "56789");

long position = -1;
long bytesRead = 0;
int newLineBytes = System.Environment.NewLine.Length;

using (var sr = new StreamReader("test.txt"))

    string line = sr.ReadLine();
    bytesRead += line.Length + newLineBytes;

    Console.WriteLine(line);

    position = bytesRead;


Console.WriteLine("Wait");

using (var sr = new StreamReader("test.txt"))

    sr.BaseStream.Seek(position, SeekOrigin.Begin);
    Console.WriteLine(sr.ReadToEnd());

对于更复杂的文本文件编码，您可能需要比这更高级，但它对我有用。

【讨论】：

使用 -1 初始化位置是否有特殊原因？ String.Length 方法返回字符数而不是字节数。因此，不考虑任何多字节字符，因此该代码充其量是高度特定的，最坏的情况是危险的。见MSDN documentation for this method

以上是关于StreamReader 和寻找的主要内容，如果未能解决你的问题，请参考以下文章

TextReader 和StreamReader

StreamReader.Read 和 StreamReader.ReadBlock 之间的区别

HTTPWebResponse + StreamReader 非常慢

StreamReader和StreamWriter中文乱码问题

StreamReader和StreamWriter说明

StreamReader类