Windows下使用Tesseract进行OCR文字识别

Posted 2021-05-25 Data+Science+Insight

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Windows下使用Tesseract进行OCR文字识别相关的知识，希望对你有一定的参考价值。

Windows下使用Tesseract进行OCR文字识别

Tesseract最初由惠普实验室支持，用于电子版文字识别，1996年被移植到Windows上，1998年进行了C++化，在2005年Tesseract由惠普公司宣布开源。2006年到现在，由Google公司维护开发。

Tesseract可以处理很多自然语音，英语、葡萄牙语系、意第绪语等。截止到2015年为止支持超过100种书面语言，并且可以通过训练学习轻松掌握其他语言。

最初Tesseract是用C语言写的，在1998年改用C++。Tesseract是无GUI交互的，可以通过命令后被执行。但是有一些其他软件提供GUI对Tesseract进行了封装。

安装包：

pip install tesseract
pip install tesseract-ocr
pip install pytesseract

Windows本地tesseract程序安装：

通过在Stack Overflow上查询，去https://github.com/UB-Mannheim/tesseract/wiki；

根据自己笔记本的情况下载如下的exe文件。

安装之后并配置如下信息：

这里我们把

tesseract-ocr-w64-setup-v5.0.0-alpha.20210506.exe (64 bit) resp.

安装并存放在了C:\\\\Program Files\\\\Tesseract-OCR\\\\目录，

并摄者如下引导信息

pytesseract.pytesseract.tesseract_cmd = 'C:\\\\Program Files\\\\Tesseract-OCR\\\\tesseract.exe'

python终端运行

python tesseract.py --image apple_support.png --min-conf 0

jupyter内运行：

%run tesseract.py --image apple_support.png --min-conf 0

代码：


# import the necessary packages
from pytesseract import Output
import pytesseract
import argparse
import cv2

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
                help="path to input image to be OCR'd")
ap.add_argument("-c", "--min-conf", type=int, default=0,
                help="mininum confidence value to filter weak text detection")
args = vars(ap.parse_args())


# load the input image, convert it from BGR to RGB channel ordering,
# and use Tesseract to localize each area of text in the input image
image = cv2.imread(args["image"])
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_data(rgb, output_type=Output.DICT)

# loop over each of the individual text localizations
for i in range(0, len(results["text"])):
    # extract the bounding box coordinates of the text region from
    # the current result
    x = results["left"][i]
    y = results["top"][i]
    w = results["width"][i]
    h = results["height"][i]
    # extract the OCR text itself along with the confidence of the
    # text localization
    text = results["text"][i]
    conf = int(float(results["conf"][i]))
    
# filter out weak confidence text localizations
if conf > args["min_conf"]:
    # display the confidence and text to our terminal
    print("Confidence: {}".format(conf))
    print("Text: {}".format(text))
    print("")
    # strip out non-ASCII text so we can draw the text on the image
    # using OpenCV, then draw a bounding box around the text along
    # with the text itself
    text = "".join([c if ord(c) < 128 else "" for c in text]).strip()
    cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)
    cv2.putText(image, text, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX,1.2, (0, 0, 255), 3)
    # show the output image
    

cv2.imshow("Image", image)
cv2.waitKey(0)

参考：UB-Mannheim/tesseract

参考：Tesseract

参考：Pytesseract : “TesseractNotFound Error: tesseract is not installed or it's not in your path”, how do I fix this?

参考：ValueError: invalid literal for int() with base 10: ''

参考：Tesseract OCR: Text localization and detection

参考：OCR：使用开源框架Tesseract做文字识别（安装）

参考：Installing Tesseract for OCR

参考：

以上是关于Windows下使用Tesseract进行OCR文字识别的主要内容，如果未能解决你的问题，请参考以下文章

Python图片文字识别——Windows下Tesseract-OCR的安装与使用

Windows下命令行及Java+Tesseract-OCR对图像进行（字母+数字+中文）识别，亲测可行

Windows安装用于OCR的Tesseract及使用命令行参数进行OCR

windows下tesseract-ocr的安装及使用

使用Python，OpenCV进行Tesseract-OCR绑定及识别