tesseract简单试用

Posted 2023-04-05 fishegg

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了tesseract简单试用相关的知识，希望对你有一定的参考价值。

目的：通过截图获取多语言文本，与多语言文档对比，确定文本是否正确

ocr.py

截图可以是1.单个文件；2.adb截图；3.目录下所有图片文件

import image_process
import tesseract_process
import book_process
import os

option=int(input("1=file,2=adb,3=directory:"))
strings=[]
book_file=
book=book_process.book(book_file)
book.read()
if(option==1):
    image_file=
    language=image_file.split("\\\\")[-1].split(".")[0]
    image=image_process.image_picker.get_image_by_path(image_file)
    strings=tesseract_process.text_recognition.get_text(image,language)
    book.record(strings,image_file)
elif(option==2):
    language=input("language:")
    while(True):
        image,image_file=image_process.image_picker.get_image_by_adb(language)
        strings=tesseract_process.text_recognition.get_text(image,language)
        book.record(strings,image_file)
        flag=input("input n to stop or enter to continue")
        if(flag=="n"):
            break
elif(option==3):
    dir=
    image_set=image_process.image_picker.get_image_from_dir(dir)
    for image_file in image_set:
        language=image_file.split(".")[0]
        image_file=os.path.join(dir,image_file)
        print(image_file)
        image=image_process.image_picker.get_image_by_path(image_file)
        strings=tesseract_process.text_recognition.get_text(image,language)
        book.record(strings,image_file)
book.save()

image_process.py

处理获取图片的逻辑，为tesseract返回Pillow的image对象

from PIL import Image
import os
from datetime import datetime

class image_picker(object):
    def get_image_by_path(file):
        image=Image.open(file)
        return image

    def get_image_by_adb(language):
        timestamp=str(int(datetime.now().timestamp()))
        image_file=language+"."+timestamp+".png"
        command="adb shell screencap -p /sdcard/"+image_file
        os.system(command)
        command="adb pull /sdcard/"+image_file+" ./"
        os.system(command)
        command="adb shell rm /sdcard/"+image_file
        os.system(command)
        file="./"+image_file
        image=Image.open(file)
        return image,image_file

    def get_image_from_dir(dir):
        types=("png","jpg","jpeg")
        image_set=set()
        for a,b,files in os.walk(dir):
            for file in files:
                if(file.split(".")[-1] in types):
                    image_set.add(file)
        return image_set

tesseract_process.py

使用tesseract获取图片上的文本，使用两个空格作为分隔符，返回单词的列表

import pytesseract
import re
import image_process

class text_recognition(object):
    def get_text(image,lang):
        text=pytesseract.image_to_string(image,lang=lang,config="--psm 3 -c preserve_interword_spaces=1")
        result=re.split(r"\\n|\\s2,",text)
        return result

book_process.py

多语言文档储存在xlsx文件，A列为给定的文本，B列为对比结果，C列为发现文本的次数，D列为发现文本的图片文件

import openpyxl

class book(object):
    def __init__(self,file):
        self.__file=file
    def read(self):
        self.__book=openpyxl.load_workbook(self.__file)
        self.__sheet=self.__book["Sheet1"]
        rowidx=1
        for row in self.__sheet.iter_rows(min_col=3,max_col=3,values_only=True):
            for count in row:
                if(count>0):
                    print("init error")
                    self.__book=None
                    return None
        return self.__book
    def write(self,row,column,value):
        self.__sheet.cell(row,column,value)

    def record(self,strings,image_file):
        words=
        rowidx=1
        for row in self.__sheet.iter_rows(max_col=1,values_only=True):
            for word in row:
                words[word]=rowidx
                # print(words[word],word)
            rowidx+=1
        for word in strings:
            if(word in words):
                rowidx=words[word]
                print("found %s at %d" % (word[:15],rowidx))
                resultcell=self.__sheet.cell(rowidx,2,"found")
                countcell=self.__sheet.cell(rowidx,3,self.__sheet.cell(rowidx,3).value+1)
                pathcell=self.__sheet.cell(rowidx,4,str(self.__sheet.cell(rowidx,4).value)+"\\r"+image_file)

    def save(self):
        self.__book.save(self.__file)

验证码识别 Tesseract的简单使用和总结

Tesseract是什么

OCR即光学字符识别，是指通过电子设备扫描纸上的打印的字符，然后翻译成计算机文字的过程。也就是说通过输入图片，经过识别引擎，去识别图片上的文字。Tesseract是一种适用于各种操作系统的光学字符识别引擎，最早是hp公司的软件，2005年开源，2006年后由google一直赞助Tesseract开发和维护。2006年，Tesseract被认为是当时最准确的开源OCR引擎之一。

验证码识别类型

这里讨论一般的验证码识别，即英文、数字、或者英文和数字的混合的验证码，不包括滑动和文字点击这些类型。

Tesseract的安装

Tesseract的github地址:https://github.com/tesseract-ocr/tesseract
Tesseract的安装github上有说明，Tesseract现在有3.05的版本，也有4.0beta版，我自己使用之后感觉2者差异不大，替换之后识别率也没有明显提升，所以只要使用其中一个就好。Tesseract支持windows和linux，windows下装完之后有个Tesseract-ocr的目录，目录下有个tesseract.exe的程序，可以通过调用这个exe的命令行去进行ocr的识别。

Tesseract的使用

简单的命令行使用如下：

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

参数说明：

imagename　　图片文件
outputbase　　输出文件，也可以选择命令行输出stdout

可选参数

-l lang　　识别库，默认是eng，也可以是自己训练出来的识别库
-psm pagesegmode 识别模式

pagesegmode 具体含义见下图

0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

Tesseract训练

可以通过jTessBoxEditor去训练Tesseract，而且训练样本越多，识别准确度越好，实际使用中我训练了500张图片，对识别率的提升还是有的，但是还是没能达到自己想要的预期识别率，估计是样本还不够多吧。另外对样本一个个修正也是个繁琐的事情，尤其是验证码，一般都各种变形以防止程序轻易识别，不过总体来说只要样本够多，想要达到预期的识别率还是可以的。关于jTessBoxEditor训练的详细步骤，有兴趣的可以自己去搜索Tesseract相关资料了解。

以上是关于tesseract简单试用的主要内容，如果未能解决你的问题，请参考以下文章