Python easyOCR图像文本提取初识

Posted 2022-12-28 Rolei_zl

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python easyOCR图像文本提取初识相关的知识，希望对你有一定的参考价值。

博物馆一日游，拍照片无数。分类整理，希望图片中的文字进行识别，加上各展馆、各展品的说明。
手工一张张的整理，慢，累，要老命。。。。。。
还好，模块化、低代码时代，效率、性能、界面、易用性暂不过多考虑，解决问题先，省点力气、省点时间。

OCR

OCR（optical character recognition，光学字符识别），指电子设备（如扫描仪或数码相机）检查/获取自然界打印/显示的字符，然后用字符识别方法将形状字符翻译成计算机文字的过程。即对文本资料进行扫描/摄像后形成图像文件，然后通过OCR技术对图像文件进行分析处理，获取文字及版面信息的过程。 -- 百度百科
easyOCR

EasyOCR 是一个用于从图像中提取文本的 python 模块，它是一种通用的 OCR，既可以读取自然场景文本，也可以读取文档中的密集文本。目前支持 80 多种语言和所有流行的书写脚本，包括：拉丁文、中文、阿拉伯文、梵文、西里尔文等。
安装easyOCR模块库

使用pip install命令，与easyocr相关的模块库一并安装，以下是安装后的模块库列表，包括OCR、深度学习（torch）、图像处理（pillow）、数值处理（numpy）等多个模块库。

>>> pip install easyocr
Requirement already satisfied: easyocr in c:\\python39\\lib\\site-packages (1.6.2)
Requirement already satisfied: scikit-image in c:\\python39\\lib\\site-packages (from easyocr) (0.19.3)
Requirement already satisfied: Pillow in c:\\python39\\lib\\site-packages (from easyocr) (9.2.0)
Requirement already satisfied: PyYAML in c:\\python39\\lib\\site-packages (from easyocr) (6.0)
Requirement already satisfied: torch in c:\\python39\\lib\\site-packages (from easyocr) (1.13.0)
Requirement already satisfied: pyclipper in c:\\python39\\lib\\site-packages (from easyocr) (1.3.0.post3)
Requirement already satisfied: python-bidi in c:\\python39\\lib\\site-packages (from easyocr) (0.4.2)
Requirement already satisfied: Shapely in c:\\python39\\lib\\site-packages (from easyocr) (1.8.5.post1)
Requirement already satisfied: numpy in c:\\python39\\lib\\site-packages (from easyocr) (1.23.4)
Requirement already satisfied: scipy in c:\\python39\\lib\\site-packages (from easyocr) (1.9.2)
Requirement already satisfied: opencv-python-headless<=4.5.4.60 in c:\\python39\\lib\\site-packages (from easyocr) (4.5.4.60)
Requirement already satisfied: ninja in c:\\python39\\lib\\site-packages (from easyocr) (1.10.2.4)
Requirement already satisfied: torchvision>=0.5 in c:\\python39\\lib\\site-packages (from easyocr) (0.14.0)
Requirement already satisfied: typing-extensions in c:\\python39\\lib\\site-packages (from torchvision>=0.5->easyocr) (4.4.0)
Requirement already satisfied: requests in c:\\python39\\lib\\site-packages (from torchvision>=0.5->easyocr) (2.25.1)
Requirement already satisfied: six in c:\\python39\\lib\\site-packages (from python-bidi->easyocr) (1.16.0)
Requirement already satisfied: networkx>=2.2 in c:\\python39\\lib\\site-packages (from scikit-image->easyocr) (2.8.7)
Requirement already satisfied: PyWavelets>=1.1.1 in c:\\python39\\lib\\site-packages (from scikit-image->easyocr) (1.4.1)
Requirement already satisfied: packaging>=20.0 in c:\\python39\\lib\\site-packages (from scikit-image->easyocr) (21.3)
Requirement already satisfied: imageio>=2.4.1 in c:\\python39\\lib\\site-packages (from scikit-image->easyocr) (2.22.1)
Requirement already satisfied: tifffile>=2019.7.26 in c:\\python39\\lib\\site-packages (from scikit-image->easyocr) (2022.10.10)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\\python39\\lib\\site-packages (from packaging>=20.0->scikit-image->easyocr) (2.4.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\\python39\\lib\\site-packages (from requests->torchvision>=0.5->easyocr) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\\python39\\lib\\site-packages (from requests->torchvision>=0.5->easyocr) (1.26.3)
Requirement already satisfied: idna<3,>=2.5 in c:\\python39\\lib\\site-packages (from requests->torchvision>=0.5->easyocr) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\\python39\\lib\\site-packages (from requests->torchvision>=0.5->easyocr) (4.0.0)

图像文字识别

* 对比原文，识别率还可，不用全部一张张、一个字一个字的手工抄写了。
* 可以通过对图片的对比度、灰度、字体、显示角度（旋转）转化后提高文字识别率。-- 待实践

import easyocr

reader = easyocr.Reader(['ch_sim','en'], gpu=True)
result = reader.readtext('pic_file.jpg')
print(result)

>>>
CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU.

([[12, 0], [292, 0], [292, 24], [12, 24]], '博物馆一日游。拒照片无数。分类整理', 0.5019760698786572)
([[298, 0], [500, 0], [500, 24], [298, 24]], '希望图片牛的文字进行识别', 0.2667440711212794)
([[506, 0], [711, 0], [711, 24], [506, 24]], '加上各展馆。各展品的说明。', 0.48956195253399476)
([[12, 26], [280, 26], [280, 50], [12, 50]], '手工一张张的整理。慢。累。要老命。', 0.443645141397)
([[12, 52], [260, 52], [260, 76], [12, 76]], '还好。模块化。低代码时代。效率', 0.48323813949440303)
([[268, 52], [358, 52], [358, 76], [268, 76]], '性能。界面', 0.7953857046933088)
([[364, 52], [516, 52], [516, 76], [364, 76]], '易用性暂下过多考虑', 0.6913828229274245)
([[522, 52], [612, 52], [612, 76], [522, 76]], '解决问题先', 0.8767933218561421)
([[620, 52], [776, 52], [776, 76], [620, 76]], '省点力气。省点时间。', 0.563630720606001)

说明

* easyocr.Reader

Reader(lang_list, gpu=True, model_storage_directory=None, user_network_directory=None, detect_network='craft', recog_network='standard', download_enabled=True, detector=True, recognizer=True, verbose=True, quantize=True, cudnn_benchmark=False)

- lang_list： detection model language file list
- gpu：是否使用gpu进行运算，不使用则使用CPU进行运算 -- 似乎很耗资源，简单测试大批量图片时，个人机器直接重启
- model_storage_directory： detection model language file list 存储位置。默认windows 10：C:\\Users\\Administrator\\.EasyOCR\\model
- detect_network： Text Detection Model，需从 Jaided AI: EasyOCR model hub 下载
- download_enabled：如果缺少detection model，是否可以直接下载

* reader.readtext()

readtext(self, image, decoder='greedy', beamWidth=5, batch_size=1, workers=0, allowlist=None, blocklist=None, detail=1, rotation_info=None, paragraph=False, min_size=20, contrast_ths=0.1, adjust_contrast=0.5, filter_ths=0.003, text_threshold=0.7, low_text=0.4, link_threshold=0.4, canvas_size=2560, mag_ratio=1.0, slope_ths=0.1, ycenter_ths=0.5, height_ths=0.5, width_ths=0.5, y_ths=0.5, x_ths=1.0, add_margin=0.1, threshold=0.2, bbox_min_score=0.2, bbox_min_size=3, max_candidates=0, output_format='standard')

- 参数说明，未研究，待后续
- 返回识别结果列表：文本框坐标 -> 文本 -> 识别精度
初识过程问题记录

* pip help install，查看pip install使用参数及方法

* 使用 pip install 安装模块时，响应慢时，可以尝试使用国内的服务进行下载
- pip install -i https://pypi.tuna.tsinghua.edu.cn/simple easyocr

* 访问时，如有HTTP/HTTPS的SSL安全限制时，可使用 --trusted-host 选项
- --trusted-host <hostname> Mark this host or host:port pair as trusted, even though it does not have valid or any HTTPS.

* CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU.
Downloading detection model, please wait. This may take several minutes depending upon your network connection.
- 下载 detection model，即识别模型；包括 easyocr.Reader中的lang_list、craft中的语言包
- 下载地址：Jaided AI: EasyOCR model hub
- 如果不清楚缺少哪些detection model，可以设置 download_enabled=False，通过提示信息确认缺少内容。如下提示，Missing ./model\\craft_mlt_25k.pth
- 安装：下载后为*.zip文件，如 craft_mlt_25k.zip，解压后将 craft_mlt_25k.pth 放入设置的 model_storage_directory 文件夹中即可
```
..........
raise FileNotFoundError("Missing %s and downloads disabled" % detector_path)
FileNotFoundError: Missing ./model\\craft_mlt_25k.pth and downloads disabled
```
* 关于CUDA
- CUDA，Compute Unified Device Architecture，显卡厂商NVIDIA推出的运算平台、并行运算架构，使GPU（graphics processing unit，图形处理器）能够解决复杂的计算问题。
- 下载地址，CUDA Toolkit Archive | NVIDIA Developer
- 查看 CUDA 版本：CMD ->命令： nvidia-smi （注意安装版本不能高过显示的硬件版本）
- 模看CUDA安装：CMD ->命令： nvcc -V
- 使用细节，未研究，待后续

* AttributeError: partially initialized module 'cv2' has no attribute 'gapi_wip_gst_GStreamerPipeline' (most likely due to a circular import)
- opencv-python-headless，版本不匹配
- pip uninstall 卸载，然后使用 pip install 重新安装
opencv-python-headless<=4.5.4.60 in c:\\python39\\lib\\site-packages (from easyocr) (4.5.4.60)

* WARNING: Ignoring invalid distribution -pencv-python-headless (python_install_path\\lib\\site-packages)
- 安装 opencv-python-headless 时出错形成的临时文件，位置： python_install_path\\lib\\site-packages
- 解决方法：python安装lib库文件夹下找到该文件，直接删除，重新安装即可

* ERROR: Could not install packages due to an OSError: [WinError 5] 拒绝访问。: '%APPDATA%\\Python\\..........'
Consider using the `--user` option or check the permissions.
- 使用 --user参数，例，pip install --user *********************
- 命令说明：--user Install to the Python user install directory for your platform. Typically ~/.local/, or %APPDATA%\\Python on Windows. (See the Python documentation for site.USER_BASE for full details.)

* cv.gapi.wip.GStreamerPipeline = cv.gapi_wip_gst_GStreamerPipeline
AttributeError: partially initialized module 'cv2' has no attribute 'gapi_wip_gst_GStreamerPipeline' (most likely due to a circular import)
- opencv-python 与 opencv-python-headless 版本不一致
- 解决方法：确认库模块版，uninstall后重新安装指定版本

附代码提示：当前文件夹下、后缀为 jpg 的、图像文字识别，输出到 GetText.txt 文件

import easyocr
import glob
import os,os.path
from pathlib import Path

reader = easyocr.Reader(['ch_sim','en'],gpu=True, model_storage_directory='./model',verbose=True,download_enabled=False)
fn = 1

ckfile = Path("./GetText.txt")
if ckfile.exists():
    os.remove(ckfile)

for f in glob.glob('./*.*'):
    result = ""
    if f.endswith('jpg'):
        result = reader.readtext(f)
        
        print("################ ", f.split('\\\\',1)[1], " ################")
        temp = ""        
        for i in result:
            temp = temp + i[1]
            print(i)
            
        with open("./GetText.txt","a",encoding='utf-8') as fp:
            fp.write("################ " + f.split('\\\\',1)[1] + " ################\\n")
            fp.write(temp)
            fp.write("\\n\\n\\n")
        fn = fn + 1

参考：

以上是关于Python easyOCR图像文本提取初识的主要内容，如果未能解决你的问题，请参考以下文章

Python easyOCR图像文本提取 初识

Python easyOCR图像文本提取初识