[tesseract-ocr][原创]tesseract训练lstm模型报错:LSTM: Training - Error msg - Encoding of string failed!

Posted FL1623863129

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[tesseract-ocr][原创]tesseract训练lstm模型报错:LSTM: Training - Error msg - Encoding of string failed!相关的知识,希望对你有一定的参考价值。

报错原因:

请参阅TrainingTesseract 4.00 · tesseract-ocr/tesseract Wiki · GitHub

Encoding of string failed! results when the text string for a training image 
cannot be encoded using the given unicharset. 

Possible causes are:

- There  is an un-represented character in the text, say a British Pound sign that is not in your unicharset.

- A  stray unprintable character (like tab or a control character) in the text.

- There  is an un-represented Indic grapheme/aksara in the text.

In any case it will result in that training image being ignored by the trainer. 

If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

其实上面意思归根到底是你训练的数据集里面不在字符集里面,由于是finetune模型一般是不需要自己做字符集,这就导致使用字符集刚好不包含你自定义的数据集中的字符,一般会忽略这种字符,不会使得训练受到影响,但是会导致你无法识别出来,因此我们可以在训练时候指定字符集

mkdir -p ~/tesstutorial/tellayer_from_tel 

combine_tessdata -e ../tessdata/tel.traineddata \\
  ~/tesstutorial/tellayer_from_tel/tel.lstm
  
lstmtraining -U ~/tesstutorial/tel/tel.unicharset \\
  --script_dir ../langdata  --debug_interval 0 \\
  --continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \\
  --append_index 5 --net_spec '[Lfx256 O1c105]' \\
  --model_output ~/tesstutorial/tellayer_from_tel/tellayer \\
  --train_listfile ~/tesstutorial/tel/tel.training_files.txt \\
  --target_error_rate 0.01

字符集怎么生成呢:

采用下面命令:

unicharset_extractor --output_unicharset chi_sim.unicharset --norm_mode 1 FIRC.box

set_unicharset_properties -U chi_sim.unicharset -O chi_sim.unicharset --script_dir ./

参考文献:

怎样使用已有的工具训练Tesseract 3.03–3.05来识别新的语言_Wordsky的博客-CSDN博客

https://github.com/tesseract-ocr/tesseract/issues/549

以上是关于[tesseract-ocr][原创]tesseract训练lstm模型报错:LSTM: Training - Error msg - Encoding of string failed!的主要内容,如果未能解决你的问题,请参考以下文章

tesseract-ocr

Python图片文字识别——Windows下Tesseract-OCR的安装与使用

tesseract-OCR + pytesseract安装

Python调用Tesseract-OCR完成图片OCR识别

tesseract-ocr图片识别开源工具

Tesseract-OCR-03-图片文字识别