pytesseract 仅使用 tesseract 4.0 数字不起作用

Posted 2023-02-23

技术标签:

【中文标题】pytesseract 仅使用 tesseract 4.0 数字不起作用【英文标题】：pytesseract using tesseract 4.0 numbers only not working 【发布时间】：2018-03-16 09:28:52 【问题描述】：

有人试图在python中只调用最新版本的tesseract 4.0来获取数字吗？

以下在 3.05 中工作，但在 4.0 中仍然返回字符，我尝试删除所有配置文件，但仍然没有工作；任何帮助都会很棒：

im 是日期图片，黑字白底：

import pytesseract
im =  imageOfDate
im = pytesseract.image_to_string(im, config='outputbase digits')
print(im)

【问题讨论】：

将图片添加到问题中，让回答者看到您的问题。我选择了***.com/questions/9413216/…。 @CuriousGeorge：您找到升级问题的解决方案了吗？升级到 v4.1.1 对我没有帮助。我还必须下载tessdata_fast 版本的trainddata 文件。我附上详细的shell script 从源安装 4.1.1。 【参考方案1】：

您可以指定tessedit_char_whitelist 中的数字作为配置选项，如下所示。

ocr_result = pytesseract.image_to_string(image, lang='eng',config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

【讨论】：

【参考方案2】：

正如您在this GitHub issue 中看到的，黑名单和白名单不适用于 tesseract 4.0 版。

这个问题有 3 种可能的解决方案，正如我在 this blog article 中描述的那样：

创建一个使用简单正则表达式提取所有数字的python函数：

def replace_chars(text):
    list_of_numbers = re.findall(r'\d+', text)
    result_number = ''.join(list_of_numbers)
    return result_number

result_number = pytesseract.image_to_string(im)

【讨论】：

谢谢！从源代码更新到版本 4.1.1 已解决问题。 github.com/tesseract-ocr/tesseract/releases【参考方案3】：

在 pytesseract 中使用 tessedit_char_whitelist 标志对我不起作用。但是，一种解决方法是使用一个有效的标志，即 config='digits'：

import pytesseract
text = pytesseract.image_to_string(pixels, config='digits')

其中像素是图像的 numpy 数组（PIL 图像也应该可以使用）。这应该迫使你的 pytesseract 只返回数字。现在，要自定义它返回的内容，请在 Windows 上找到您的数字配置文件，位于此处：

C:\Program Files (x86)\Tesseract-OCR\tessdata\configs

打开数字文件并添加您想要的任何字符。保存并运行 pytesseract 后，它应该只返回那些自定义字符。

【讨论】：

如果我需要文字和数字怎么办？您可以将文本和数字都放在数字配置文件中。例如，您可以输入“1234567890abcdefg...”，它只会返回那些字母数字字符。您使用的是哪个版本？？我使用 pytesseract==0.3.0 适用于 2020 年最新的 tesseract config=digits 仅对来自字母数字输入的数字进行白名单。如何将图像仅视为数字而不是字母数字，有什么想法吗？像对待 l 一样对待 one 而不是 L【参考方案4】：

您可以将tessedit_char_whitelist 中的数字指定为config option，如下所示。

ocr_result = pytesseract.image_to_string(image, lang='eng', boxes=False, \
           config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

希望对您有所帮助。

【讨论】：

"oem" 在配置参数中被错误输入为"eom" 此解决方案不适用于 tesseract 4.0+。 GitHub 上有一个与此相关的未解决问题：github.com/tesseract-ocr/tesseract/issues/751。试图在 5 月修正错字，但不知何故仍然显示 --eom。无论如何，重新修复它。正如 Jakub 所说，它不适用于 4.0。相反，数字有一个单独的 tessdata 文件我正在寻找用于识别时间的 OCR。例如。 11:25 。将冒号 (:) 添加到白名单不起作用。有什么想法吗？

以上是关于pytesseract 仅使用 tesseract 4.0 数字不起作用的主要内容，如果未能解决你的问题，请参考以下文章