正方体训练 - 微调字符
Posted
技术标签:
【中文标题】正方体训练 - 微调字符【英文标题】:Tesseract training - Finetuning Characters 【发布时间】:2020-01-27 11:34:43 【问题描述】:我想为一个新角色训练我现有的 tesseract 模型。我已经在
上尝试过教程https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line
(微调±几个字符)(我使用的是 MAC)
但它不起作用。如果我评估(即使是在训练数据上),它也无法识别 ± 字符。
我安装了:
tesseract 5.0.0-alpha-447-g52cf
leptonica-1.78.0
libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6
通过:
我将以下 GitHub 存储库克隆到我的桌面并安装了 tesseract:
https://github.com/tesseract-ocr/tesseract.git
https://github.com/tesseract-ocr/langdata_lstm
https://github.com/tesseract-ocr/tessdata_best
我的安装如下:
安装:
brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc
运行
ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c
进入克隆的 tesseract 文件夹。
~/Desktop/tesseract
运行 autogen.sh:
./autogen.sh
安装依赖:
brew install cairo pango icu4c autoconf libffi libarchive libpng
export PKG_CONFIG_PATH=\
(brew --prefix)/lib/pkgconfig:\
(brew --prefix)/opt/libarchive/lib/pkgconfig:\
(brew --prefix)/opt/icu4c/lib/pkgconfig:\
(brew --prefix)/opt/libffi/lib/pkgconfig:\
(brew --prefix)/opt/libpng/lib/pkgconfig
(如果有些已经安装,请使用重新安装而不是安装)
运行配置:
./configure
安装tesseract:
make
sudo make install
安装培训工具:
make training
sudo make training-install
之后,我将 eng.traineddata 从 tessdata_best 插入到 tesseract/tessdata
我的训练代码如下:
# GENERATE TRAINING DATA
rm -rf ~/Desktop/tesstutorial/trainplusminus/*
PANGOCAIRO_BACKEND=fc \
~/Desktop/tesseract/src/training/tesstrain.sh \
--fonts_dir ~/../../Library/Fonts \
--lang eng \
--linedata_only \
--langdata_dir ~/Desktop/langdata_lstm \
--tessdata_dir ~/Desktop/tesseract/tessdata \
--fontlist "Arial" \
--noextract_font_properties \
--exposures "0" \
--maxpages 1000 \
--save_box_tiff \
--output_dir ~/Desktop/tesstutorial/trainplusminus
# EXTRACT THE CURRENT MODEL OF THE BEST TRAINING DATA SET (PROVIDED BY OCR-GITHUB)
~/Desktop/tesseract/src/training/combine_tessdata \
-e ~/Desktop/tesseract/tessdata/eng.traineddata ~/Desktop/tesstutorial/trainplusminus/eng.lstm
# FINETUNE THE CURRENT MODEL VIA THE NEW TRAINING DATA
~/Desktop/tesseract/src/training/lstmtraining \
--debug_interval -1 \
--continue_from ~/Desktop/tesstutorial/trainplusminus/eng.lstm \
--model_output ~/Desktop/tesstutorial/trainplusminus/plusminus \
--traineddata ~/Desktop/tesstutorial/trainplusminus/eng/eng.traineddata \
--old_traineddata ~/Desktop/tesseract/tessdata/eng.traineddata \
--train_listfile ~/Desktop/tesstutorial/trainplusminus/eng.training_files.txt \
--max_iterations 5000
# COMBINE THE NEW BEST TRAINING DATA
lstmtraining --stop_training \
--continue_from ~/Desktop/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/Desktop/tesstutorial/trainplusminus/eng/eng.traineddata \
--old_traineddata ~/Desktop/tesseract/tessdata/eng.traineddata \
--model_output ~/Desktop/tesstutorial/trainplusminus/eng.traineddata
我不知道为什么这段代码没有产生我期望的结果。我试图训练一种新字体,上面的代码有效。我为微调新字符所做的唯一更改是将文本添加到 langdata_lstm/eng/eng.training_text:
alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
感谢您的帮助!
达斯汀
【问题讨论】:
尝试将 unicharset 文件从 langdata_lstm 目录更改为 Latin.unicharset。那么训练过程就可以正常进行了。 你能得到预期的结果吗? 【参考方案1】:如果您在训练后获得的 eng.traineddata 文件适用于所有字符和整数,唯一的问题是它无法识别您刚刚尝试添加的“±”符号,请尝试以下操作:
-
确保“±”存在于 eng.charset_size=xx 中,并且
eng.unicharset 文件。
在 engdata_lstm/eng/eng.training_text 文件中,取 2000 左右
带有“±”的行出现了大约 200 次。
--max_iterations 应至少为 3000 [用于微调新字符]
希望这会有所帮助... 谢谢,你的问题帮助了我.. :)
【讨论】:
感谢这些提示,这对我有用!我从来没有让角色出现在以上是关于正方体训练 - 微调字符的主要内容,如果未能解决你的问题,请参考以下文章