如何使用 Python 或 Tesseract OCR 从输入图像中检测语言或脚本？

Posted 2023-04-17

技术标签:

【中文标题】如何使用 Python 或 Tesseract OCR 从输入图像中检测语言或脚本？【英文标题】：How to detect language or script from an input image using Python or Tesseract OCR? 【发布时间】：2022-01-08 21:59:35 【问题描述】：

给定一个可以是任何语言或书写系统的输入图像，我如何检测图片中的文本使用什么脚本？

任何基于 Python 或 Tesseract-OCR 的解决方案都将不胜感激。

请注意，此处的脚本是指拉丁文、西里尔文、梵文等等对应语言的书写系统，如英文、俄文、印地文等（分别）

【问题讨论】：

【参考方案1】：

先决条件：

安装 Tesseract：sudo apt install tesseract-ocr tesseract-ocr-all 安装 PyTes-s-ract：pip install pytesseract

脚本检测：

import pytesseract
import re

def detect_image_lang(img_path):
    try:
        osd = pytesseract.image_to_osd(img_path)
        script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
        conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
        return script, float(conf)
    except e:
        return None, 0.0

script_name, confidence = detect_image_lang("image.png")

语言检测：

执行 OCR (using Tesseract) 后，传递文本 through langdetect library（或任何其他库）。

【讨论】：

Check here for list of all scripts & languages supported by Tesseract OCR.

以上是关于如何使用 Python 或 Tesseract OCR 从输入图像中检测语言或脚本？的主要内容，如果未能解决你的问题，请参考以下文章