让 tesseract 只识别数字

Posted 2023-04-17

技术标签:

【中文标题】让 tesseract 只识别数字【英文标题】：Make tesseract recognise numbers only 【发布时间】：2012-07-03 11:51:06 【问题描述】：

我正在尝试改进我制作的 OCR 程序，以读取我正在使用的某个图像的布局。现在，我希望我的 OCR 程序只识别数字 0-9。

我试图按照问题的解决方案：

Limit characters tesseract is looking for

但我陷入了必须将 tesseract 称为：

tesseract input.tif output nobatch letters

这个去哪儿了？

【问题讨论】：

【参考方案1】：

我在使用 python 时遇到了同样的问题，使用 tesseract 3 假设更多的读者可能会这样做。

从这里：https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-do-i-recognize-only-digits

这里： https://github.com/madmaze/pytesseract/blob/27fed535bf1eb665ec991313841b177336b50f61/src/pytesseract.py#L91

我成功使用了：

pytesseract.image_to_string(someimage, config='outputbase digits')

【讨论】：

【参考方案2】：

我前段时间在 SO 中发布了一些关于 tesseract 的内容：请参阅 Tesseract OCR Library - Learning Font。值得注意的是link to tesseract training，它会告诉你如何限制你的字符集并描述你的歧义。

【讨论】：

【参考方案3】：

这个问题在Tesseract FAQ回答

下面是如何让 tesseract 只识别数字：

Tesseract 2 - 在调用 Init 函数之前或将其放入名为 tessdata/configs/digits 的文本文件中：

tessedit_char_whitelist 0123456789

然后你的命令行变成：

tesseract image.tif outputbase nobatch digits

Tesseract 3 - 已经创建了一个数字配置文件，所以只需像这样运行一个 tesseract 命令：

tesseract imagename outputbase digits

【讨论】：

【参考方案4】：

这是你用来在命令行上运行的命令。

为了获得更好的答案，我们需要知道您是在命令行上运行 tesseract 还是作为库运行。

【讨论】：

以上是关于让 tesseract 只识别数字的主要内容，如果未能解决你的问题，请参考以下文章