Pytesseract 提高 OCR 准确性

Posted

技术标签:

【中文标题】Pytesseract 提高 OCR 准确性【英文标题】:Pytesseract Improve OCR Accuracy 【发布时间】:2021-01-13 20:42:53 【问题描述】:

我想从python 的图像中提取文本。为此,我选择了pytesseract。当我尝试从图像中提取文本时,结果并不令人满意。我还浏览了this 并实现了列出的所有技术。然而,它似乎表现不佳。

图片:

代码:

import pytesseract
import cv2
import numpy as np

img = cv2.imread('D:\\wordsimg.png')

img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
txt = pytesseract.image_to_string(img ,lang = 'eng')

txt = txt[:-1]

txt = txt.replace('\n',' ')

print(txt)

输出:

t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was 

即使是 1 个不需要的空间也会让我付出很多代价。我希望结果是 100% 准确的。任何帮助,将不胜感激。谢谢!

【问题讨论】:

【参考方案1】:

我将 resize 从 1.2 更改为 2 并删除了所有预处理。我用 psm 11 和 psm 12 得到了很好的结果

import pytesseract
import cv2
import numpy as np

img = cv2.imread('wavy.png')

#  img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.resize(img, None, fx=2, fy=2)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
#  img = cv2.dilate(img, kernel, iterations=1)
#  img = cv2.erode(img, kernel, iterations=1)

#  img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

cv2.imwrite('thresh.png', img)

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
    
for psm in range(6,13+1):
    config = '--oem 3 --psm %d' % psm
    txt = pytesseract.image_to_string(img, config = config, lang='eng')
    print('psm ', psm, ':',txt)

config = '--oem 3 --psm %d' % psm 行使用string interpolation (%) operator 将%d 替换为整数(psm)。我不太确定oem 做了什么,但我已经养成了使用它的习惯。更多关于 psm 在这个答案的末尾。

psm  11 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm  12 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm是分页模式的简称。我不完全确定不同的模式是什么。您可以从描述中了解代码是什么。您可以从tesseract --help-psm获取列表

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

【讨论】:

酷!一个小恩惠。你能解释一下psm是什么吗? config = '--oem 3 --psm %d' % psm 是什么意思? 如果你觉得我的问题很好并且框架很好,那么请考虑支持我的问题。谢谢!

以上是关于Pytesseract 提高 OCR 准确性的主要内容,如果未能解决你的问题,请参考以下文章

提高 Python Tesseract OCR 的准确性

如何使用 Pytesseract 文本识别改进 OCR?

通过 pytesseract 和 PIL 提高文本识别的准确性

使用 pytesseract 提高结果时如何设置配置 load_system_dawg?

提高扫描文档的 OCR 准确性

使用拼写检查提高Tesseract OCR准确性