如何改进 pytesseract 参数以正常工作
Posted
技术标签:
【中文标题】如何改进 pytesseract 参数以正常工作【英文标题】:how to improve pytesseract arguments to work properly 【发布时间】:2021-12-15 14:08:07 【问题描述】:我想使用 pytesseract 阅读此验证码:
我听从这里的建议:Use pytesseract OCR to recognize text from an image
我的代码是:
import pytesseract
import cv2
def captcha_to_string(picture):
image = cv2.imread(picture)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening
cv2.imwrite('thresh.jpg', thresh)
cv2.imwrite('opening.jpg', opening)
cv2.imwrite('invert.jpg', invert)
# Perform text extraction
text = pytesseract.image_to_string(invert, lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
return text
但我的代码返回8\n\x0c
,这是无稽之谈。
这是 thresh 的样子:
这是打开的样子:
这是反转的样子:
你能帮我吗,我怎样才能改进captcha_to_string
功能以正确读取验证码?非常感谢。
【问题讨论】:
【参考方案1】:你是在正确的方式。去除噪点(倒置图像中的小黑点)看起来是成功提取文本的方法。
仅供参考,pytessearct
的配置只会使结果更糟。所以,我删除了它。
我的做法如下:
import pytesseract
import cv2
import matplotlib.pyplot as plt
import numpy as np
def remove_noise(img,threshold):
"""
remove salt-and-pepper noise in a binary image
"""
filtered_img = np.zeros_like(img)
labels,stats= cv2.connectedComponentsWithStats(img.astype(np.uint8),connectivity=8)[1:3]
label_areas = stats[1:, cv2.CC_STAT_AREA]
for i,label_area in enumerate(label_areas):
if label_area > threshold:
filtered_img[labels==i+1] = 1
return filtered_img
def preprocess(img_path):
"""
convert the grayscale captcha image to a clean binary image
"""
img = cv2.imread(img_path,0)
blur = cv2.GaussianBlur(img, (3,3), 0)
thresh = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY_INV)[1]
filtered_img = 255-remove_noise(thresh,20)*255
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
erosion = cv2.erode(filtered_img,kernel,iterations = 1)
return erosion
def extract_letters(img):
text = pytesseract.image_to_string(img)#,config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
return text
img = preprocess('captcha.jpg')
text=extract_letters(img)
print(text)
plt.imshow(img,'gray')
plt.show()
这是处理后的图像。
并且,脚本返回18L9R
。
【讨论】:
以上是关于如何改进 pytesseract 参数以正常工作的主要内容,如果未能解决你的问题,请参考以下文章
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your p
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your P
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your p