[tesseract-ocr][原创]tesseract训练lstm模型报错:LSTM: Training - Error msg - Encoding of string failed!
Posted FL1623863129
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[tesseract-ocr][原创]tesseract训练lstm模型报错:LSTM: Training - Error msg - Encoding of string failed!相关的知识,希望对你有一定的参考价值。
报错原因:
请参阅TrainingTesseract 4.00 · tesseract-ocr/tesseract Wiki · GitHub
Encoding of string failed! results when the text string for a training image
cannot be encoded using the given unicharset.
Possible causes are:
- There is an un-represented character in the text, say a British Pound sign that is not in your unicharset.
- A stray unprintable character (like tab or a control character) in the text.
- There is an un-represented Indic grapheme/aksara in the text.
In any case it will result in that training image being ignored by the trainer.
If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.
其实上面意思归根到底是你训练的数据集里面不在字符集里面,由于是finetune模型一般是不需要自己做字符集,这就导致使用字符集刚好不包含你自定义的数据集中的字符,一般会忽略这种字符,不会使得训练受到影响,但是会导致你无法识别出来,因此我们可以在训练时候指定字符集
mkdir -p ~/tesstutorial/tellayer_from_tel
combine_tessdata -e ../tessdata/tel.traineddata \\
~/tesstutorial/tellayer_from_tel/tel.lstm
lstmtraining -U ~/tesstutorial/tel/tel.unicharset \\
--script_dir ../langdata --debug_interval 0 \\
--continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \\
--append_index 5 --net_spec '[Lfx256 O1c105]' \\
--model_output ~/tesstutorial/tellayer_from_tel/tellayer \\
--train_listfile ~/tesstutorial/tel/tel.training_files.txt \\
--target_error_rate 0.01
字符集怎么生成呢:
采用下面命令:
unicharset_extractor --output_unicharset chi_sim.unicharset --norm_mode 1 FIRC.box
set_unicharset_properties -U chi_sim.unicharset -O chi_sim.unicharset --script_dir ./
参考文献:
怎样使用已有的工具训练Tesseract 3.03–3.05来识别新的语言_Wordsky的博客-CSDN博客
https://github.com/tesseract-ocr/tesseract/issues/549
以上是关于[tesseract-ocr][原创]tesseract训练lstm模型报错:LSTM: Training - Error msg - Encoding of string failed!的主要内容,如果未能解决你的问题,请参考以下文章
Python图片文字识别——Windows下Tesseract-OCR的安装与使用