用juniversalchardet解决爬虫乱码问题

Posted 袜子破了

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了用juniversalchardet解决爬虫乱码问题相关的知识,希望对你有一定的参考价值。

 

 

        爬虫往往会遇到乱码问题。最简单的方法是根据http的响应信息来获取编码信息。但如果对方网站的响应信息不包含编码信息或编码信息错误,那么爬虫取下来的信息就很可能是乱码。

       好的解决办法是直接根据页面内容来自动判断页面的编码。如Mozilla公司的firefox使用的universalchardet编码自动检测工具。

       juniversalchardet是universalchardet的Java版本。源码开源于 https://github.com/thkoch2001/juniversalchardet

       自动编码主要是根据统计学的方法来判断。具体原理,可以看http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

       现在以Java爬虫常用的httpclient来讲解如何使用。看以下关键代码:

 
UniversalDetector encDetector = new UniversalDetector(null);  
    while ((l = myStream.read(tmp)) != -1) {  
        buffer.append(tmp, 0, l);  
        if (!encDetector.isDone()) {  
            encDetector.handleData(tmp, 0, l);  
        }  
    }  
encDetector.dataEnd();  
String encoding = encDetector.getDetectedCharset();  
if (encoding != null) {  
    return new String(buffer.toByteArray(), encoding);  
}  
encDetector.reset();  

  

  1. myStream.read(tmp)) 读取httpclient得到的流。我们要做的就是在读流的同时,运用juniversalchardet来检测编码,如果有符合特征的编码的出现,则最后可通过detector.getDetectedCharset()  
  2. 可以得到编码,否则返回null。至此,检测工作结束,通过String的构造方法来进行按一定编码构建字符串。  



http://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3

<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

  

 

https://code.google.com/archive/p/juniversalchardet/

 

Java port of universalchardet

1. What is it?

juniversalchardet is a Java port of ‘universalchardet‘, that is the encoding detector library of Mozilla.

The original code of universalchardet is available athttp://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

Techniques used by universalchardet are described athttp://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

2. Encodings that can be detected

  • Chinese

    • ISO-2022-CN
    • BIG5
    • EUC-TW
    • GB18030
    • HZ-GB-23121
  • Cyrillic

    • ISO-8859-5
    • KOI8-R
    • WINDOWS-1251
    • MACCYRILLIC
    • IBM866
    • IBM855
  • Greek

    • ISO-8859-7
    • WINDOWS-1253
  • Hebrew

    • ISO-8859-8
    • WINDOWS-1255
  • Japanese

    • ISO-2022-JP
    • SHIFT_JIS
    • EUC-JP
  • Korean

    • ISO-2022-KR
    • EUC-KR
  • Unicode

    • UTF-8
    • UTF-16BE / UTF-16LE
    • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
  • Others

    • WINDOWS-1252

1 Currently not supported by Java

3. How to use it

  1. Construct an instance of org.mozilla.universalchardet.UniversalDetector.
  2. Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
  3. Notify the detector of the end of data by calling UniversalDetector.dataEnd().
  4. Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
  5. Don‘t forget to call UniversalDetector.reset() before you reuse the detector instance.

Sample Code

Download ``` import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector { public static void main(String[] args) throws java.io.IOException { byte[] buf = new byte[4096]; String fileName = args[0]; java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

// (1)
UniversalDetector detector = new UniversalDetector(null);

// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
  detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();

// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
  System.out.println("Detected encoding = " + encoding);
} else {
  System.out.println("No encoding detected.");
}

// (5)
detector.reset();

} } ```

4. Related Works

jchardet

  • http://jchardet.sourceforge.net/ jchardet is another Java port of the Mozilla‘s encoding dectection library. The main difference between jchardet and juniversalchardet is modules they are based on. jchardet is based on the ‘chardet‘ module that has long existed. juniversalchardet is based on the ‘universalchardet‘ module that is new and generally provides better accuracy on detection results.

5. License

The library is subject to the Mozilla Public License Version 1.1. Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.

 

以上是关于用juniversalchardet解决爬虫乱码问题的主要内容,如果未能解决你的问题,请参考以下文章

python爬虫抓取到的数据用网页打开时是乱码,怎么解决

Python 爬虫数据写入csv文件中文乱码解决以及天眼查爬虫数据写入csv

网络爬虫在爬取网页时,响应头没有编码信息...如何解决保存在本地的乱码问题?

python爬虫抓下来的网页,中间的中文乱码怎么解决

java爬虫一段话里的部分字符乱码解决

node爬虫解决网页编码为gb2312结果为乱码的方法