tesseract 没有得到小标签

Posted 2023-04-17

技术标签:

【中文标题】tesseract 没有得到小标签【英文标题】：tesseract didn't get the little labels 【发布时间】：2017-02-06 07:08:07 【问题描述】：

我已经在我的 linux 环境中安装了 tesseract。

当我执行类似

# tesseract myPic.jpg /output

但是我的图片有一些小标签，而 tesseract 没有看到它们。

是否可以设置音高或类似的选项？

文本标签示例：

有了这张图片，tesseract 无法识别任何值...

但是有了这张照片：

我有以下输出：

J8

J7A-J7B P7 \

2
40 50 0 180 190

200

P1 P2 7

110 110
\ l

例如，在这种情况下，90（左上角）没有被 tesseract 看到...

我认为这只是一个定义或类似想法的选项，不是吗？

谢谢

【问题讨论】：

【参考方案1】：

为了从 Tesseract（以及任何 OCR 引擎）获得准确的结果，您需要遵循一些准则，如我在这篇文章中的回答所示： Junk results when using Tesseract OCR and tess-two

这里是它的要点：

使用高分辨率图像（如果需要）最低 300 DPI

确保图像中没有阴影或弯曲

如果有任何歪斜，您需要在 ocr 之前在代码中修复图像

使用字典帮助获得好的结果

调整文字大小（12 pt字体比较理想）

将图像二值化并使用图像处理算法去噪

还建议花一些时间训练 OCR 引擎以获得更好的结果，如以下链接所示：Training Tesseract

我拍摄了您共享的 2 张图像，并使用 LEADTOOLS SDK 对它们进行了一些图像处理（免责声明：我是这家公司的员工），并且能够获得比处理后的图像更好的结果，但由于原始图像不是最好的 - 它仍然不是 100%。这是我用来尝试修复图像的代码：

//initialize the codecs class
using (RasterCodecs codecs = new RasterCodecs())

   //load the file
   using (RasterImage img = codecs.Load(filename))
   
      //Run the image processing sequence starting by resizing the image
      double newWidth = (img.Width / (double)img.XResolution) * 300;
      double newHeight = (img.Height / (double)img.YResolution) * 300;
      SizeCommand sizeCommand = new SizeCommand((int)newWidth, (int)newHeight, RasterSizeFlags.Resample);
      sizeCommand.Run(img);

      //binarize the image
      AutoBinarizeCommand autoBinarize = new AutoBinarizeCommand();
      autoBinarize.Run(img);

      //change it to 1BPP
      ColorResolutionCommand colorResolution = new ColorResolutionCommand();
      colorResolution.BitsPerPixel = 1;
      colorResolution.Run(img);

      //save the image as PNG
      codecs.Save(img, outputFile, RasterImageFormat.Png, 0);

以下是此过程的输出图像：

【讨论】：

谢谢回复，但是为什么它不能识别所有标签，例如第二张左上角的90，似乎很容易阅读您可能需要训练引擎以获得更好的结果或使用更好的起始图像，这样您就不必插入像素并调整它们的大小。我的情况最好的分割方法是什么？我认为在你的场景中默认是最好的：github.com/tesseract-ocr/tesseract/wiki/…

以上是关于tesseract 没有得到小标签的主要内容，如果未能解决你的问题，请参考以下文章