如何检测文本文件的字符编码?

Posted

技术标签:

【中文标题】如何检测文本文件的字符编码?【英文标题】:How to detect the character encoding of a text file? 【发布时间】:2011-05-30 01:02:26 【问题描述】:

我尝试检测我的文件中使用了哪种字符编码。

我尝试使用此代码获取标准编码

public static Encoding GetFileEncoding(string srcFile)
    
      // *** Use Default of Encoding.Default (Ansi CodePage)
      Encoding enc = Encoding.Default;

      // *** Detect byte order mark if any - otherwise assume default
      byte[] buffer = new byte[5];
      FileStream file = new FileStream(srcFile, FileMode.Open);
      file.Read(buffer, 0, 5);
      file.Close();

      if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
        enc = Encoding.UTF8;
      else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;
      else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
        enc = Encoding.UTF32;
      else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
        enc = Encoding.UTF7;
      else if (buffer[0] == 0xFE && buffer[1] == 0xFF)      
        // 1201 unicodeFFFE Unicode (Big-Endian)
        enc = Encoding.GetEncoding(1201);      
      else if (buffer[0] == 0xFF && buffer[1] == 0xFE)      
        // 1200 utf-16 Unicode
        enc = Encoding.GetEncoding(1200);


      return enc;
    

我的五个第一个字节是 60、118、56、46 和 49。

是否有图表显示哪种编码与前五个字节匹配?

【问题讨论】:

不应使用字节顺序标记来检测编码。有些情况下使用哪种编码不明确:UTF-16 LE 和 UTF-32 LE 都以相同的两个字节开头。 BOM 只能用于检测字节顺序(因此得名)。此外,UTF-8 严格来说甚至不应该有字节顺序标记,添加一个可能会干扰一些不希望它的软件。 @Mark Ba​​yers,有没有办法让我的文件中使用女巫编码? @Mark Byers:UTF-32 LE 以与 UTF-16 LE 相同的 2 个字节开头。然而,它也跟着字节 00 00 这在 UTF-16 LE 中(我认为非常)不太可能。此外,理论上 BOM 应该如您所说,但在实践中,它作为一个签名来显示它的编码。见:unicode.org/faq/utf_bom.html#bom4 Mark Beyers:您的评论完全错误。 BOM 是一种检测编码的防弹方法。 UTF16 BE 和 UTF32 BE 没有歧义。在编写错误的 cmets 之前,您应该研究该主题。如果一个软件不能处理 UTF8 BOM,那么这个软件要么是 1980 年代的,要么是编程错误的。今天,每个软件都应该处理和识别 BOM。 Elmue 显然从未对纯文本文件流使用批量过滤、连接和管道重定向。在这种情况下处理/支持 BOM 是不现实的。 【参考方案1】:

您不能依赖具有 BOM 的文件。 UTF-8 不需要它。非 Unicode 编码甚至没有 BOM。但是,还有其他方法可以检测编码。

UTF-32

BOM 为 00 00 FE FF(对于 BE)或 FF FE 00 00(对于 LE)。

但即使没有 BOM,UTF-32 也很容易检测到。这是因为 Unicode 码位范围被限制为 U+10FFFF,因此 UTF-32 单元始终具有模式 00 00-10 xx xx(对于 BE)或 xx xx 00-10 00(对于 LE) .如果数据的长度是 4 的倍数,并且遵循这些模式之一,您可以放心地假定它是 UTF-32。由于 00 字节在面向字节的编码中很少见,因此几乎不可能出现误报。

US-ASCII

没有 BOM,但您不需要。 ASCII 可以通过缺少 80-FF 范围内的字节来轻松识别。

UTF-8

BOM 是 EF BB BF。但是你不能依赖这个。许多 UTF-8 文件没有 BOM,尤其是如果它们源自非 Windows 系统。

但您可以放心地假设,如果文件验证为 UTF-8,则它 UTF-8。误报很少见。

具体来说,鉴于数据不是ASCII,2字节序列的误报率只有3.9%(1920/49152)。对于一个 7 字节的序列,它小于 1%。对于 12 字节的序列,它小于 0.1%。对于 24 字节的序列,它不到百万分之一。

UTF-16

BOM 是 FE FF(对于 BE)或 FF FE(对于 LE)。请注意,UTF-16LE BOM 位于 UTF-32LE BOM 的开头,因此请先检查 UTF-32。

如果您碰巧有一个主要由 ISO-8859-1 字符组成的文件,则文件的一半字节为 00 也将是 UTF-16 的有力指标。

否则,在没有 BOM 的情况下识别 UTF-16 的唯一可靠方法是查找代理对 (D[8-B]xx D[CF]xx),但非 BMP 字符很少用于生成这种方法很实用。

XML

如果您的文件以字节 3C 3F 78 6D 6C(即 ASCII 字符“encoding= 声明。如果存在,则使用该编码。如果不存在,则假定为 UTF-8,这是默认的 XML 编码。

如果需要支持 EBCDIC,还要寻找等价的序列 4C 6F A7 94 93。

一般来说,如果您的文件格式包含编码声明,则查找该声明而不是试图猜测编码。

以上都不是

还有数百种其他编码,需要付出更多努力才能检测到。我建议尝试Mozilla's charset detector 或a .NET port of it。

合理的默认设置

如果您已排除 UTF 编码,并且没有指向不同编码的编码声明或统计检测,请假设 ISO-8859-1 或密切相关的 Windows-1252。 (请注意,最新的 HTML 标准要求将“ISO-8859-1”声明解释为 Windows-1252。)作为 Windows 英语(以及其他流行语言,如西班牙语、葡萄牙语)的默认代码页、德语和法语),它是除 UTF-8 之外最常见的编码。

【讨论】:

好吧,正如我所料。你能解决区分 UTF-8/UTF-16 的问题吗? PS:感谢您提供非常有用的答案。 +1 对于 UTF-16BE 文本文件,如果一定百分比的偶数字节被清零(或检查 UTF-16LE 的奇数字节),那么编码很可能是 UTF-16。你怎么看? UTF-8 的有效性可以通过位模式检查很好地检测出来;第一个字节的位模式准确地告诉您后面有多少字节,并且后​​面的字节也有控制位要检查。模式都显示在这里:ianthehenry.com/2015/1/17/decoding-utf-8 @marsze 这不是 my 的答案...也没有提及,因为这是关于 检测, 并且,正如我所提到的,您无法真正检测到简单的每个符号一个字节的编码。不过,我在这个地方有个人posted an answer (模糊地)识别它。 @marsze:在那里,我为 Latin-1 添加了一个部分。【参考方案2】:

如果你想追求一个“简单”的解决方案,你可能会发现我整理的这个课程很有用:

http://www.architectshack.com/TextFileEncodingDetector.ashx

它首先自动进行 BOM 检测,然后尝试区分没有 BOM 的 Unicode 编码与其他一些默认编码(通常是 Windows-1252,在 .Net 中错误地标记为 Encoding.ASCII)。

如上所述,涉及 NCharDet 或 MLang 的“更重”的解决方案可能更合适,正如我在本课程的概述页面上所指出的,最好的办法是尽可能提供与用户的某种形式的交互,因为根本不可能有 100% 的检测率!

网站离线时的片段:

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace KlerksSoft

    public static class TextFileEncodingDetector
    
        /*
         * Simple class to handle text file encoding woes (in a primarily English-speaking tech 
         *      world).
         * 
         *  - This code is fully managed, no shady calls to MLang (the unmanaged codepage
         *      detection library originally developed for Internet Explorer).
         * 
         *  - This class does NOT try to detect arbitrary codepages/charsets, it really only
         *      aims to differentiate between some of the most common variants of Unicode 
         *      encoding, and a "default" (western / ascii-based) encoding alternative provided
         *      by the caller.
         *      
         *  - As there is no "Reliable" way to distinguish between UTF-8 (without BOM) and 
         *      Windows-1252 (in .Net, also incorrectly called "ASCII") encodings, we use a 
         *      heuristic - so the more of the file we can sample the better the guess. If you 
         *      are going to read the whole file into memory at some point, then best to pass 
         *      in the whole byte byte array directly. Otherwise, decide how to trade off 
         *      reliability against performance / memory usage.
         *      
         *  - The UTF-8 detection heuristic only works for western text, as it relies on 
         *      the presence of UTF-8 encoded accented and other characters found in the upper 
         *      ranges of the Latin-1 and (particularly) Windows-1252 codepages.
         *  
         *  - For more general detection routines, see existing projects / resources:
         *    - MLang - Microsoft library originally for IE6, available in Windows XP and later APIs now (I think?)
         *      - MLang .Net bindings: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
         *    - CharDet - Mozilla browser's detection routines
         *      - Ported to Java then .Net: http://www.conceptdevelopment.net/Localization/NCharDet/
         *      - Ported straight to .Net: http://code.google.com/p/chardetsharp/source/browse
         *  
         * Copyright Tao Klerks, 2010-2012, tao@klerks.biz
         * Licensed under the modified BSD license:
         * 
Redistribution and use in source and binary forms, with or without modification, are 
permitted provided that the following conditions are met:
 - Redistributions of source code must retain the above copyright notice, this list of 
conditions and the following disclaimer.
 - Redistributions in binary form must reproduce the above copyright notice, this list 
of conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
 - The name of the author may not be used to endorse or promote products derived from 
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, 
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY 
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY 
OF SUCH DAMAGE.
         * 
         * CHANGELOG:
         *  - 2012-02-03: 
         *    - Simpler methods, removing the silly "DefaultEncoding" parameter (with "??" operator, saves no typing)
         *    - More complete methods
         *      - Optionally return indication of whether BOM was found in "Detect" methods
         *      - Provide straight-to-string method for byte arrays (GetStringFromByteArray)
         */

        const long _defaultHeuristicSampleSize = 0x10000; //completely arbitrary - inappropriate for high numbers of files / high speed requirements

        public static Encoding DetectTextFileEncoding(string InputFilename)
        
            using (FileStream textfileStream = File.OpenRead(InputFilename))
            
                return DetectTextFileEncoding(textfileStream, _defaultHeuristicSampleSize);
            
        

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize)
        
            bool uselessBool = false;
            return DetectTextFileEncoding(InputFileStream, _defaultHeuristicSampleSize, out uselessBool);
        

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize, out bool HasBOM)
        
            if (InputFileStream == null)
                throw new ArgumentNullException("Must provide a valid Filestream!", "InputFileStream");

            if (!InputFileStream.CanRead)
                throw new ArgumentException("Provided file stream is not readable!", "InputFileStream");

            if (!InputFileStream.CanSeek)
                throw new ArgumentException("Provided file stream cannot seek!", "InputFileStream");

            Encoding encodingFound = null;

            long originalPos = InputFileStream.Position;

            InputFileStream.Position = 0;


            //First read only what we need for BOM detection
            byte[] bomBytes = new byte[InputFileStream.Length > 4 ? 4 : InputFileStream.Length];
            InputFileStream.Read(bomBytes, 0, bomBytes.Length);

            encodingFound = DetectBOMBytes(bomBytes);

            if (encodingFound != null)
            
                InputFileStream.Position = originalPos;
                HasBOM = true;
                return encodingFound;
            


            //BOM Detection failed, going for heuristics now.
            //  create sample byte array and populate it
            byte[] sampleBytes = new byte[HeuristicSampleSize > InputFileStream.Length ? InputFileStream.Length : HeuristicSampleSize];
            Array.Copy(bomBytes, sampleBytes, bomBytes.Length);
            if (InputFileStream.Length > bomBytes.Length)
                InputFileStream.Read(sampleBytes, bomBytes.Length, sampleBytes.Length - bomBytes.Length);
            InputFileStream.Position = originalPos;

            //test byte array content
            encodingFound = DetectUnicodeInByteSampleByHeuristics(sampleBytes);

            HasBOM = false;
            return encodingFound;
        

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData)
        
            bool uselessBool = false;
            return DetectTextByteArrayEncoding(TextData, out uselessBool);
        

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData, out bool HasBOM)
        
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            
                HasBOM = true;
                return encodingFound;
            
            else
            
                //test byte array content
                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData);

                HasBOM = false;
                return encodingFound;
            
        

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding)
        
            return GetStringFromByteArray(TextData, DefaultEncoding, _defaultHeuristicSampleSize);
        

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding, long MaxHeuristicSampleSize)
        
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            
                //For some reason, the default encodings don't detect/swallow their own preambles!!
                return encodingFound.GetString(TextData, encodingFound.GetPreamble().Length, TextData.Length - encodingFound.GetPreamble().Length);
            
            else
            
                byte[] heuristicSample = null;
                if (TextData.Length > MaxHeuristicSampleSize)
                
                    heuristicSample = new byte[MaxHeuristicSampleSize];
                    Array.Copy(TextData, heuristicSample, MaxHeuristicSampleSize);
                
                else
                
                    heuristicSample = TextData;
                

                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData) ?? DefaultEncoding;
                return encodingFound.GetString(TextData);
            
        


        public static Encoding DetectBOMBytes(byte[] BOMBytes)
        
            if (BOMBytes == null)
                throw new ArgumentNullException("Must provide a valid BOM byte array!", "BOMBytes");

            if (BOMBytes.Length < 2)
                return null;

            if (BOMBytes[0] == 0xff 
                && BOMBytes[1] == 0xfe 
                && (BOMBytes.Length < 4 
                    || BOMBytes[2] != 0 
                    || BOMBytes[3] != 0
                    )
                )
                return Encoding.Unicode;

            if (BOMBytes[0] == 0xfe 
                && BOMBytes[1] == 0xff
                )
                return Encoding.BigEndianUnicode;

            if (BOMBytes.Length < 3)
                return null;

            if (BOMBytes[0] == 0xef && BOMBytes[1] == 0xbb && BOMBytes[2] == 0xbf)
                return Encoding.UTF8;

            if (BOMBytes[0] == 0x2b && BOMBytes[1] == 0x2f && BOMBytes[2] == 0x76)
                return Encoding.UTF7;

            if (BOMBytes.Length < 4)
                return null;

            if (BOMBytes[0] == 0xff && BOMBytes[1] == 0xfe && BOMBytes[2] == 0 && BOMBytes[3] == 0)
                return Encoding.UTF32;

            if (BOMBytes[0] == 0 && BOMBytes[1] == 0 && BOMBytes[2] == 0xfe && BOMBytes[3] == 0xff)
                return Encoding.GetEncoding(12001);

            return null;
        

        public static Encoding DetectUnicodeInByteSampleByHeuristics(byte[] SampleBytes)
        
            long oddBinaryNullsInSample = 0;
            long evenBinaryNullsInSample = 0;
            long suspiciousUTF8SequenceCount = 0;
            long suspiciousUTF8BytesTotal = 0;
            long likelyUSASCIIBytesInSample = 0;

            //Cycle through, keeping count of binary null positions, possible UTF-8 
            //  sequences from upper ranges of Windows-1252, and probable US-ASCII 
            //  character counts.

            long currentPos = 0;
            int skipUTF8Bytes = 0;

            while (currentPos < SampleBytes.Length)
            
                //binary null distribution
                if (SampleBytes[currentPos] == 0)
                
                    if (currentPos % 2 == 0)
                        evenBinaryNullsInSample++;
                    else
                        oddBinaryNullsInSample++;
                

                //likely US-ASCII characters
                if (IsCommonUSASCIIByte(SampleBytes[currentPos]))
                    likelyUSASCIIBytesInSample++;

                //suspicious sequences (look like UTF-8)
                if (skipUTF8Bytes == 0)
                
                    int lengthFound = DetectSuspiciousUTF8SequenceLength(SampleBytes, currentPos);

                    if (lengthFound > 0)
                    
                        suspiciousUTF8SequenceCount++;
                        suspiciousUTF8BytesTotal += lengthFound;
                        skipUTF8Bytes = lengthFound - 1;
                    
                
                else
                
                    skipUTF8Bytes--;
                

                currentPos++;
            

            //1: UTF-16 LE - in english / european environments, this is usually characterized by a 
            //  high proportion of odd binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of even binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.Unicode;


            //2: UTF-16 BE - in english / european environments, this is usually characterized by a 
            //  high proportion of even binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of odd binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.BigEndianUnicode;


            //3: UTF-8 - Martin Dürst outlines a method for detecting whether something CAN be UTF-8 content 
            //  using regexp, in his w3c.org unicode FAQ entry: 
            //  http://www.w3.org/International/questions/qa-forms-utf-8
            //  adapted here for C#.
            string potentiallyMangledString = Encoding.ASCII.GetString(SampleBytes);
            Regex UTF8Validator = new Regex(@"\A(" 
                + @"[\x09\x0A\x0D\x20-\x7E]"
                + @"|[\xC2-\xDF][\x80-\xBF]"
                + @"|\xE0[\xA0-\xBF][\x80-\xBF]"
                + @"|[\xE1-\xEC\xEE\xEF][\x80-\xBF]2"
                + @"|\xED[\x80-\x9F][\x80-\xBF]"
                + @"|\xF0[\x90-\xBF][\x80-\xBF]2"
                + @"|[\xF1-\xF3][\x80-\xBF]3"
                + @"|\xF4[\x80-\x8F][\x80-\xBF]2"
                + @")*\z");
            if (UTF8Validator.IsMatch(potentiallyMangledString))
            
                //Unfortunately, just the fact that it CAN be UTF-8 doesn't tell you much about probabilities.
                //If all the characters are in the 0-127 range, no harm done, most western charsets are same as UTF-8 in these ranges.
                //If some of the characters were in the upper range (western accented characters), however, they would likely be mangled to 2-byte by the UTF-8 encoding process.
                // So, we need to play stats.

                // The "Random" likelihood of any pair of randomly generated characters being one 
                //   of these "suspicious" character sequences is:
                //     128 / (256 * 256) = 0.2%.
                //
                // In western text data, that is SIGNIFICANTLY reduced - most text data stays in the <127 
                //   character range, so we assume that more than 1 in 500,000 of these character 
                //   sequences indicates UTF-8. The number 500,000 is completely arbitrary - so sue me.
                //
                // We can only assume these character sequences will be rare if we ALSO assume that this
                //   IS in fact western text - in which case the bulk of the UTF-8 encoded data (that is 
                //   not already suspicious sequences) should be plain US-ASCII bytes. This, I 
                //   arbitrarily decided, should be 80% (a random distribution, eg binary data, would yield 
                //   approx 40%, so the chances of hitting this threshold by accident in random data are 
                //   VERY low). 

                if ((suspiciousUTF8SequenceCount * 500000.0 / SampleBytes.Length >= 1) //suspicious sequences
                    && (
                           //all suspicious, so cannot evaluate proportion of US-Ascii
                           SampleBytes.Length - suspiciousUTF8BytesTotal == 0 
                           ||
                           likelyUSASCIIBytesInSample * 1.0 / (SampleBytes.Length - suspiciousUTF8BytesTotal) >= 0.8
                       )
                    )
                    return Encoding.UTF8;
            

            return null;
        

        private static bool IsCommonUSASCIIByte(byte testByte)
        
            if (testByte == 0x0A //lf
                || testByte == 0x0D //cr
                || testByte == 0x09 //tab
                || (testByte >= 0x20 && testByte <= 0x2F) //common punctuation
                || (testByte >= 0x30 && testByte <= 0x39) //digits
                || (testByte >= 0x3A && testByte <= 0x40) //common punctuation
                || (testByte >= 0x41 && testByte <= 0x5A) //capital letters
                || (testByte >= 0x5B && testByte <= 0x60) //common punctuation
                || (testByte >= 0x61 && testByte <= 0x7A) //lowercase letters
                || (testByte >= 0x7B && testByte <= 0x7E) //common punctuation
                )
                return true;
            else
                return false;
        

        private static int DetectSuspiciousUTF8SequenceLength(byte[] SampleBytes, long currentPos)
        
            int lengthFound = 0;

            if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC2
                )
            
                if (SampleBytes[currentPos + 1] == 0x81 
                    || SampleBytes[currentPos + 1] == 0x8D 
                    || SampleBytes[currentPos + 1] == 0x8F
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0x90 
                    || SampleBytes[currentPos + 1] == 0x9D
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] >= 0xA0 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC3
                )
            
                if (SampleBytes[currentPos + 1] >= 0x80 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC5
                )
            
                if (SampleBytes[currentPos + 1] == 0x92 
                    || SampleBytes[currentPos + 1] == 0x93
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xA0 
                    || SampleBytes[currentPos + 1] == 0xA1
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xB8 
                    || SampleBytes[currentPos + 1] == 0xBD 
                    || SampleBytes[currentPos + 1] == 0xBE
                    )
                    lengthFound = 2;
            
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC6
                )
            
                if (SampleBytes[currentPos + 1] == 0x92)
                    lengthFound = 2;
            
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xCB
                )
            
                if (SampleBytes[currentPos + 1] == 0x86 
                    || SampleBytes[currentPos + 1] == 0x9C
                    )
                    lengthFound = 2;
            
            else if (SampleBytes.Length >= currentPos + 2 
                && SampleBytes[currentPos] == 0xE2
                )
            
                if (SampleBytes[currentPos + 1] == 0x80)
                
                    if (SampleBytes[currentPos + 2] == 0x93 
                        || SampleBytes[currentPos + 2] == 0x94
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x98 
                        || SampleBytes[currentPos + 2] == 0x99 
                        || SampleBytes[currentPos + 2] == 0x9A
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x9C 
                        || SampleBytes[currentPos + 2] == 0x9D 
                        || SampleBytes[currentPos + 2] == 0x9E
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA0 
                        || SampleBytes[currentPos + 2] == 0xA1 
                        || SampleBytes[currentPos + 2] == 0xA2
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA6)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB0)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB9 
                        || SampleBytes[currentPos + 2] == 0xBA
                        )
                        lengthFound = 3;
                
                else if (SampleBytes[currentPos + 1] == 0x82 
                    && SampleBytes[currentPos + 2] == 0xAC
                    )
                    lengthFound = 3;
                else if (SampleBytes[currentPos + 1] == 0x84 
                    && SampleBytes[currentPos + 2] == 0xA2
                    )
                    lengthFound = 3;
            

            return lengthFound;
        

    

【讨论】:

实际上,Encoding.GetEncoding("Windows-1252") 提供的对象类与Encoding.ASCII 不同。在调试时,Windows-1252 显示为 System.Text.SBCSCodePageEncoding 对象,而 ascii 是 System.Text.ASCIIEncoding 对象。当我需要 Windows-1252 时,我从不使用 ASCII 要将正则表达式与二进制数据(字节)进行匹配,正确的方法是:string data = Encoding.GetEncoding("iso-8859-1").GetString(bytes); 因为它是唯一的单字节编码,它具有与字符串的 1 对 1 字节映射。跨度> 【参考方案3】:

使用StreamReader 并引导它为您检测编码:

using (var reader = new System.IO.StreamReader(path, true))

    var currentEncoding = reader.CurrentEncoding;

并使用 代码页标识符 https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx 为了根据它切换逻辑。

【讨论】:

不行,StreamReader 假设你的文件是 UTF-8 @Cedric:检查MSDN 是否有此构造函数。您是否有证据表明构造函数与文档不一致?当然,这在 Microsoft 的文档中是可能的 :-) 这个版本也只检查 BOM 嗯,在阅读CurrentEncoding之前,您不必打电话给Read()吗? MSDN for CurrentEncoding 表示“在第一次调用 StreamReader 的任何 Read 方法后,值可能会有所不同,因为在第一次调用 Read 方法之前不会进行编码自动检测。” 我的测试表明这不能可靠地使用,因此根本不应该使用。【参考方案4】:

这里有几个答案,但没有人发布有用的代码。

这是我的代码,用于检测 Microsoft 在 Framework 4 中的 StreamReader 类中检测到的所有编码。

显然,您必须在打开流后立即调用此函数,然后再从流中读取任何其他内容,因为 BOM 是流中的第一个字节。

此函数需要一个可以查找的 Stream(例如 FileStream)。如果您有一个无法搜索的流,您必须编写一个更复杂的代码,该代码返回一个字节缓冲区,其中包含已读取但不是 BOM 的字节。

public static Encoding DetectEncoding(String s_Path)

    using (FileStream i_Stream = new FileStream(s_Path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
    
        return DetectEncoding(i_Stream);
    


/// <summary>
/// UTF8    : EF BB BF
/// UTF16 BE: FE FF
/// UTF16 LE: FF FE
/// UTF32 BE: 00 00 FE FF
/// UTF32 LE: FF FE 00 00
/// </summary>
public static Encoding DetectEncoding(Stream i_Stream)

    if (!i_Stream.CanSeek || !i_Stream.CanRead)
        throw new Exception("DetectEncoding() requires a seekable and readable Stream");

    // Try to read 4 bytes. If the stream is shorter, less bytes will be read.
    Byte[] u8_Buf = new Byte[4];
    int s32_Count = i_Stream.Read(u8_Buf, 0, 4);
    if (s32_Count >= 2)
    
        if (u8_Buf[0] == 0xFE && u8_Buf[1] == 0xFF)
        
            i_Stream.Position = 2;
            return new UnicodeEncoding(true, true);
        

        if (u8_Buf[0] == 0xFF && u8_Buf[1] == 0xFE)
        
            if (s32_Count >= 4 && u8_Buf[2] == 0 && u8_Buf[3] == 0)
            
                i_Stream.Position = 4;
                return new UTF32Encoding(false, true);
            
            else
            
                i_Stream.Position = 2;
                return new UnicodeEncoding(false, true);
            
        

        if (s32_Count >= 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
        
            i_Stream.Position = 3;
            return Encoding.UTF8;
        

        if (s32_Count >= 4 && u8_Buf[0] == 0 && u8_Buf[1] == 0 && u8_Buf[2] == 0xFE && u8_Buf[3] == 0xFF)
        
            i_Stream.Position = 4;
            return new UTF32Encoding(true, true);
        
    

    i_Stream.Position = 0;
    return Encoding.Default;

【讨论】:

如果我只有一个文件名,如何使用这个功能?我需要这样的函数:public static Encoding DetectEncoding(string sFilename) @MichaelHutter 使用File.Open(sFilename) 获取文件流。然后继续 File.Open(sFilename) 打开一个文件,根据文件里面的BOM确定Encoding。如果缺少 BOM,则可能会通过假设错误的编码来犯错误。这个答案正在做同样的“错误”。它仅在有 BOM 时才有效。如果文件内没有 BOM,则需要像这里一样分析整个文件内容:***.com/a/69312696/9134997 此答案回答了 Cedrik 提出的问题。答案没有错误。您的错误是您没有阅读问题。在没有 BOM 的情况下对文件文本内容的任何检测都永远不会可靠。【参考方案5】:

我使用Ude,它是 Mozilla Universal Charset Detector 的 C# 端口。它易于使用,而且效果非常好。

【讨论】:

【参考方案6】:

是的,这里有一个:http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding。

【讨论】:

【参考方案7】:

您应该阅读以下内容:How can I detect the encoding/codepage of a text file

【讨论】:

【参考方案8】:

如果您的文件以字节 60、118、56、46 和 49 开头,那么您的情况不明确。它可以是 UTF-8(无 BOM)或任何单字节编码,如 ASCII、ANSI、ISO-8859-1 等。

【讨论】:

嗯...所以我需要测试所有? 那只是纯ASCII。没有特殊字符的 UTF-8 就等于 ASCII,如果有特殊字符,则使用特定的可检测位模式。 @Nyerguds 可能不是。我有一个 UTF-8 文本文件(没有“特定的可检测位模式”——而且大部分都是英文字符)。如果我用 ASCII 读取它,它无法读取一个特定的“-”符号。 不可能。如果字符不是 ascii,那么它将使用那些特定的可检测位模式进行编码; that's how utf-8 works。更有可能的是,您的文本既不是 ascii 也不是 utf-8,而只是像 Windows-1252 这样的 8 位编码。【参考方案9】:

适合所有德国人的解决方案 => ÄÖÜäöüß

此函数打开文件并通过 BOM 确定编码。 如果缺少 BOM,文件将被解释为 ANSI,但如果其中包含 UTF8 编码的德语变音符号,则会被检测为 UTF8。

private static Encoding GetEncoding(string sFileName)

    using (var reader = new StreamReader(sFileName, Encoding.Default, true))
    
        string sContent = "";
        if (reader.Peek() >= 0) // you need this!
            sContent = reader.ReadToEnd();
        Encoding MyEncoding = reader.CurrentEncoding;
        if (MyEncoding == Encoding.Default) // Ansi detected (this happens if BOM is missing)
         // Look, if there are typical UTF8 chars in this file...
            string sUmlaute = "ÄÖÜäöüß";
            bool bUTF8CharDetected = false;
            for (int z=0; z<sUmlaute.Length; z++)
            
                string sUTF8Letter = sUmlaute.Substring(z, 1);
                string sUTF8LetterInAnsi = Encoding.Default.GetString(Encoding.UTF8.GetBytes(sUTF8Letter));
                if (sContent.Contains(sUTF8LetterInAnsi))
                
                    bUTF8CharDetected = true;
                    break;
                
            
            if (bUTF8CharDetected) MyEncoding = Encoding.UTF8;
        
        return MyEncoding;
    

【讨论】:

以上是关于如何检测文本文件的字符编码?的主要内容,如果未能解决你的问题,请参考以下文章

java如何转换富文本框中的中文编码格式,且把标签变成特殊字符

让VSCode支持gbk编码

使用 .NET 如何将包含 Latin-1 重音字符的 ISO 8859-1 编码文本文件转换为 UTF-8

检测base64编码

如何更正文件的字符编码?

如何在 Java 字符串中存储 EBCDIC (IBM-1047) 编码文本而不损坏它?