如何告诉 tesseract 不要忽略单词之间的空格？

Posted 2023-04-17

技术标签:

【中文标题】如何告诉 tesseract 不要忽略单词之间的空格？【英文标题】：How to tell tesseract to not ignore blank spaces between words? 【发布时间】：2018-07-24 01:55:49 【问题描述】：

我正在尝试实现名片扫描应用程序。我正在使用 tesseract 库。

我阅读了有关提高 Tesseract 性能的文章，并且在将图像传递给 Tesseract 之前，我尝试了一些预处理图像。

我发现 Tesseract 最适合灰度/黑白图像。

我在选择正确的页面分段时遇到了问题。

到目前为止，

G8PageSegmentationModeSingleBlock（假设单个统一块文字）

给了我名片格式的最佳结果。

以下是使用这种分割模式的结果：

灰度：

当使用灰度图像时，Tesseract 正在识别单词（看红色矩形），但不知何故，有时它会识别单词之间的空间。

这是输出：

o
f l ,t!ti,iy,,,tyii,i,,!),i),,m,i,st,,,i,t,)) ',
REAL E:ESrry"irfEf
SOLUTIONS WC, n
TimTsai        ----> (space missing here)
Investor & Consultant
p 780.803.9935
f 888.803.1485
e tim@lnnoventionGroup.ca
w www.lnnoventionGroup.ca

黑白：

这在识别单词之间的空间方面比灰度要好一点，但这也将图像的边界识别为字母，并将它们附加到原始/实际文本中。（看看红色矩形是如何延伸到图像边缘的，因为分割模式设置为识别统一的文本块）

这是输出：

o,
f I t,!h,tig/i,i,,ip,,ip,iy (,
REAL ESTATE i,
SOLUTIONS INC. (i,
Tim Tsai i;,      ------> (yay, got the space)
Investor & Consultant ii,
p 780.803.9935 :i,
f 888.803.1485 i:,
e tim@lnnoventionGroup.ca (i,
,
-ee_--e_-----e----------ir-eeeereree-e-re---------------, u p

我也试过去掉边框，这次没有读到单词之间的空格。

输出：

 o
I I !,,!ih,tle/IiEhp,tt,l,l),!
REAL ESTATE
SOLUTIONS INC.
TimTsai
Investor & Consultant
p 780.803.9935
f 888.803.1485
e tim@lnnoventionGroup.ca

问题：

这种行为的原因是什么（忽略单词之间的空格？）

我可以通过什么方式改进这一点，以便 tesseract 不会一直忽略空格？

我还可以查看旋转/去偏斜，但我不确定在这种情况下这可以提高多少性能，因为文本在我看来大多是水平的。

代码：

G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:@"eng"];
tesseract.delegate = self;
tesseract.engineMode=G8OCREngineModeTesseractCubeCombined;

// Optional: Limit the character set Tesseract should try to recognize from
tesseract.charWhitelist = @"@.,&():ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 ";

tesseract.charBlacklist=@"$%^*=;<>\\~`";

// Specify the image Tesseract should recognize on
    tesseract.image = [img g8_blackAndWhite];

tesseract.sourceResolution=kG8MaxCredibleResolution;


// Optional: Limit the area of the image Tesseract should recognize on to a rectangle
CGRect tessRect = CGRectMake(0, 0, tesseract.image.size.width, tesseract.image.size.height);

    tesseract.rect = tessRect;

// Optional: Limit recognition time with a few seconds
tesseract.maximumRecognitionTime = 60.0;

// Start the recognition
[tesseract recognize];

// Retrieve the recognized text
NSLog(@"text %@", [tesseract recognizedText]);

【问题讨论】：

查看this question参数。由于您的问题是相反的-尝试降低tosp_min_sane_kn_sp 尝试设置变量preserve_interword_spaces。 @nguyenq 怎么设置？在哪里设置？类似的东西：tesseract.setVariable("preserve_interword_spaces", "1"); 【参考方案1】：

将 preserve_interword_spaces 设置为 true 以保留单词之间的多个空格。

您的代码可能如下所示：

tesseract.setVariable("preserve_interword_spaces", "1");

对于命令行界面，这样使用-c 开关：

tesseract image.jpg output -c preserve_interword_spaces=1

（来自有帮助的 cmets 的自愿回答；感谢用户 nguyenq）

【讨论】：

以上是关于如何告诉 tesseract 不要忽略单词之间的空格？的主要内容，如果未能解决你的问题，请参考以下文章