JAVA验证码识别:基于jTessBoxEditorFX和Tesseract-OCR训练样本

Posted _TBhonker

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了JAVA验证码识别:基于jTessBoxEditorFX和Tesseract-OCR训练样本相关的知识,希望对你有一定的参考价值。

JAVA验证识别:基于jTessBoxEditorFXTesseract-OCR训练样本

工具准备:

jTessBoxEditorFX下载:https://github.com/nguyenq/jTessBoxEditorFX

Tesseract-OCR下载:https://sourceforge.net/projects/tesseract-ocr/

主要步骤:

  1. JTessBoxEditorFXTesseract-OCR(环境变量配置)下载,jar包准备(maven,见下面pom文件
  2. 下载验证码到本地(代码)
  3. 转换验证码图片格式
  4. 将转换后的验证码去噪二值化,剪切边缘(代码)
  5. 使用jTessBoxEditorFX进行.box文件的校对(改正识别错误的验证码):https://www.cnblogs.com/zhongtang/p/5555950.html
  6. 使用tesseract命令行进行.traineddata的生成,然后在java中调用:https://www.cnblogs.com/zhongtang/p/5555950.html

代码如下:

 

package yanZhengMaTest.pikachu;

import java.awt.image.BufferedImage;
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;

import javax.imageio.ImageIO;
import javax.net.ssl.HttpsURLConnection;

import org.opencv.core.Core;
import org.opencv.core.CvType;
import org.opencv.core.Mat;
import org.opencv.core.Rect;
import org.opencv.core.Size;
import org.opencv.imgcodecs.Imgcodecs;
import org.opencv.imgproc.Imgproc;

import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class Test {

    static {
        System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
    }; // 用来调用OpenCV库文件,必须添加

    public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
        
        //保存验证码的文件夹
        File imgFile = new File("C:\\\\Users\\\\pc\\\\Desktop\\\\formPic\\\\unFormPic");
        //验证码保存地址
        String downAddress = "C:\\\\Users\\\\pc\\\\Desktop\\\\formPic\\\\unFormPic\\\\";
        //验证码下载地址
        String downURL = "https://www.qichamao.com/usercenter/varifyimage?t=0.6488481170232967";
        if (imgFile.listFiles().length < 400) {
            for (int i = 1; i <= 400; i++) {
                downloadPic(downURL, downAddress + i + ".gif");
                Thread.sleep(10 + (i % 100));
            }
        }
        
        //获取保存的验证码并转换为tif格式(Tesseract不支持识别gif图片)
        File imgFile0 = new File("C:\\\\Users\\\\pc\\\\Desktop\\\\formPic\\\\unFormPic");
        for (File image : imgFile0.listFiles()) {
            changePicFormat("tif", image, "C:\\\\Users\\\\pc\\\\Desktop\\\\formPic\\\\formedPic\\\\");
        }
        System.out.println("图片格式转换成功");

        //获取转换为tif格式后的验证码,并进行加工(图片去噪,二值化),增加验证码识别度
        int picNum = 1;
        File imageFile1 = new File("C:\\\\Users\\\\pc\\\\Desktop\\\\formPic\\\\formedPic");
        for (File image : imageFile1.listFiles()) {
            filterPic(image.getName(), picNum + ".tif");
            picNum++;
        }

        //获取加工后的
        File resultImgs = new File("C:\\\\Users\\\\pc\\\\Desktop\\\\result_cut");
        for (File link : resultImgs.listFiles()) {
            String reslut = getResult(link);
            System.out.println(link.getName() + "识别结果:" + reslut);
        }

    }

    // 图片处理及处理后的图片储存
    public static void filterPic(String imgName, String fileName) throws FileNotFoundException, IOException {
        // 图片去噪
        Mat src = Imgcodecs.imread("C:\\\\Users\\\\pc\\\\Desktop\\\\formPic\\\\formedPic\\\\" + imgName, Imgcodecs.IMREAD_UNCHANGED);
        Mat dst = new Mat(src.width(), src.height(), CvType.CV_8UC1);

        if (src.empty()) {
            System.out.println("没有图片");
        } else {
            System.out.println("图片处理成功");
        }

        Imgproc.boxFilter(src, dst, src.depth(), new Size(3.2, 3.2));
        Imgcodecs.imwrite("C:\\\\Users\\\\pc\\\\Desktop\\\\filter\\\\" + fileName, dst);

        // 图片阈值处理,二值化
        Mat src1 = Imgcodecs.imread("C:\\\\Users\\\\pc\\\\Desktop\\\\filter\\\\" + fileName, Imgcodecs.IMREAD_UNCHANGED);
        Mat dst1 = new Mat(src1.width(), src1.height(), CvType.CV_8UC1);

        Imgproc.threshold(src1, dst1, 165, 200, Imgproc.THRESH_TRUNC);
        Imgcodecs.imwrite("C:\\\\Users\\\\pc\\\\Desktop\\\\process\\\\" + fileName, dst1);

        // 图片截取
        Mat src2 = Imgcodecs.imread("C:\\\\Users\\\\pc\\\\Desktop\\\\process\\\\" + fileName, Imgcodecs.IMREAD_UNCHANGED);
        Rect roi = new Rect(4, 2, src2.cols() - 7, src2.rows() - 4); // 参数:x坐标,y坐标,截取的长度,截取的宽度
        Mat dst2 = new Mat(src2, roi);

        Imgcodecs.imwrite("C:\\\\Users\\\\pc\\\\Desktop\\\\result_cut\\\\" + fileName, dst2);

    }

    // 获取验证码
    public static String getResult(File imageFile) {
        if (!imageFile.exists()) {
            System.out.println("图片不存在");
        }
        Tesseract tessreact = new Tesseract();
        tessreact.setDatapath("F:\\\\Program Files (x86)\\\\Tesseract-OCR\\\\tessdata");
        tessreact.setLanguage("fontyp");    //将默认库设置为自己训练的库

        String result;
        try {
            result = tessreact.doOCR(imageFile);
            return result;
        } catch (TesseractException e) {
            e.printStackTrace();
            return null;
        }
    }

    /**
     * 图片格式转换
     * 
     * @param outputFormat
     *            转换的格式
     * @param file
     *            要转换的图片
     * @param downAddress
     *            转换后保存的地址
     * @sourse: http://www.open-open.com/code/view/1453300186683
     */
    public static void changePicFormat(String outputFormat, File image, String downAddress) {

        try {
            BufferedImage bim = ImageIO.read(image);
            File output = new File(
                    downAddress + image.getName().substring(0, image.getName().lastIndexOf(".") + 1) + outputFormat);
            ImageIO.write(bim, outputFormat, output);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 下载验证码
     * 
     * @param picUrl
     *            验证码获取地址
     * @param address
     *            图片保存地址
     */
    public static void downloadPic(String picUrl, String imgAddress) {
        try {
            URL url = new URL(picUrl);
            HttpsURLConnection conn = (HttpsURLConnection) url.openConnection();
            //需要设置头信息,否则会被识别为机器而获取不到验证码图片
            conn.setRequestProperty("User-Agent",
                    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/68.0.3440.75 Safari/537.36");
            conn.connect();

            int result = -1;
            byte[] buf = new byte[1024];
            BufferedInputStream bis = new BufferedInputStream(conn.getInputStream());
            FileOutputStream fos = new FileOutputStream(imgAddress);
            while ((result = bis.read(buf)) != -1) {
                fos.write(buf);
            }
            fos.flush();

            fos.close();
            bis.close();
            System.out.println("图片下载成功");
        } catch (MalformedURLException e) {
            System.out.println("图片读取失败");
            e.printStackTrace();
        } catch (IOException e) {
            System.out.println();
            e.printStackTrace();
        }
    }

}

 

pom文件:

        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>4.1.1</version>
            <exclusions>
                <exclusion>
                    <groupId>com.sun.jna</groupId>
                    <artifactId>jna</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.openpnp</groupId>
            <artifactId>opencv</artifactId>
            <version>3.2.0-0</version>
        </dependency>

 

参考文章:

opensv的使用:https://blog.csdn.net/u012706811/article/details/52779271
opensv教程:https://www.w3cschool.cn/opencv/opencv-me9i28vh.html
opensv二值化:https://blog.csdn.net/liyuqian199695/article/details/53925046
opensv的maven地址:https://mvnrepository.com/artifact/org.openpnp/opencv/3.4.2-0
opensv图片过滤:https://blog.csdn.net/u012393192/article/details/78528550
opensv图片修剪:https://blog.csdn.net/sileixinhua/article/details/72811093
opensv案例含tesserate命令:https://www.cnblogs.com/zhongtang/p/5555950.html

附好文:https://blog.csdn.net/lmj623565791/article/details/23960391

 

异常处理:

1. 加载库异常

Exception in thread "main" java.lang.UnsatisfiedLinkError: no opencv_java320
in java.library.path at
java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at
java.lang.Runtime.loadLibrary0(Runtime.java:870) at
java.lang.System.loadLibrary(System.java:1122) at
yanZhengMaTest.pikachu.Test.<clinit>(Test.java:38)  

解决:

将以下图片位置的路径设置为:G:\\Program Files (x86)\\apache-maven\\repo\\org\\openpnp\\opencv\\3.2.0-0\\opencv-3.2.0-0\\nu\\pattern\\opencv\\windows\\x86_64(根据自己maven的opencv包地址进行指定)。

 

2. jdk版本和opencv版本不匹配(Exception in thread "main" java.lang.UnsatisfiedLinkError: no jniopencv_highgui in java.library.path

解决:更换opencv版本

3. 使用命令行生成.tr文件时候出现异常

Page 406
Warning. Invalid resolution 1 dpi. Using 70 instead.
Estimating resolution as 269
Error during processing.

解决:可能图片转换格式或者下载的时候出错,将图片替换即可

 

以上是关于JAVA验证码识别:基于jTessBoxEditorFX和Tesseract-OCR训练样本的主要内容,如果未能解决你的问题,请参考以下文章

基于TensorFlow的简单验证码识别

基于pytesseract的简单验证码识别

基于SVM的python简单实现验证码识别

基于图像处理和卷积神经网络的文本验证码识别方案

基于GAN的验证码识别工具,0.5秒宣告验证码死刑!

测码奔腾·鼎新基于tesseract识别验证码实践