我需要一些关于 SRT 字幕文本处理的想法

Posted

技术标签:

【中文标题】我需要一些关于 SRT 字幕文本处理的想法【英文标题】:I need some idea on text processing for SRT subtitles 【发布时间】:2020-01-13 08:58:26 【问题描述】:

标题说明了我真正需要的 ATM。

基本上我已经创建了一个基于 Tesseract 和 ImageMagick 的 OCR 工具链。我已经设法使输出文本非常一致。我正在使用它对一些旧的硬字幕视频进行 OCR,并将它们制作成软字幕 SRT 潜艇。为了截取图像输入的屏幕截图,我使用了多年前发现并重写的旧 shell 脚本的修改版本。这些输入到第二个脚本中,该脚本将它们处理成 Tessaract 可读的形式。在这一点上,我可以轻松地手动完成剩余的工作,但如果可能的话,我想自动化除最后的校对过程之外的所有工作。

示例文本(来自当前项目)

03:04.418  Their parents have always written    letters thanking us. =  
03:05.018  Their parents have always written    letters thanking us. =  
03:05.619  Their parents have always written    letters thanking us. =  
03:06.219  Their parents have always written    letters thanking us. =  
03:06.820  Their parents have always written    letters thanking us. =  
03:07.421  Their parents have always written    letters thanking us. =  
03:08.021  Their parents have always written    letters thanking us. =  
03:08.622  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:09.222  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:09.823  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:10.424  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:11.024  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:11.625  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:12.225  In additien te all the previeus requests se far..."  
03:12.826  In additien te all the previeus requests se far..."  
03:13.427  In additien te all the previeus requests se far..."  
03:14.027  In additien te all the previeus requests se far..."  
03:14.628  In additien te all the previeus requests se far..."

基本上我想匹配文本并从第一行和最后一行中提取时间戳并将它们设置为 srt 格式

1
00:03:04,418 --> 00:03:08,021
Their parents have always written
letters thanking us. =  

2
00:03:08,622 --> 00:03:08,622
This seminary was highly reeemmended
| am relieved te leave her in your care. = 

3
00:03:12,225 --> 00:03:14,628
In additien te all the previeus requests se far..."

在这一点上,我可以将它作为一个单独的脚本。

基本上是sub.txt in sub.srt out。然后做一个校对通过。现在检测到的文本中存在一些可变性,但它是最小的。 I 偶尔会被检测为 |[,并且有时会在一些奇怪的极端情况下混淆 o 和 e。

2020 年 2 月 2 日编辑:

我进行了一些更改和调整,以进一步得到我想要的。到 MY shell 脚本和 Ivans。我已经消除了由 ivans 脚本和我的脚本生成的空白子行。

更新的处理和 ocr 脚本顺便说一句

#!/bin/bash -x
 
cd "$1"
mkdir ocr

for f in *.png ;
do
base="$(basename "$f" | cut -d "." -f 1,2)"
echo "$base"
if [[ -z "$2" ]] ; 
then
tran="$(convert "$f"  -separate -average  -crop +0+720 -threshold 11% -fill black -draw 'color 700,10 floodfill' +repage ocr/"$base".png)"
  
else
tran="$(convert "$f"  -separate -average  -crop +0+720 -negate -threshold 15% -fill white -draw 'color 700,10 floodfill' +repage ocr/"$base".png)"
  
fi 
$tran
cd ocr
magick mogrify -pointsize 50 -fill blue -draw 'text 1400,310 "L" ' +repage "$base".png
cd ..


done
cd ocr
for i in *.png ;
do base2="$(basename "$i" | cut -d "." -f 1,2 | cut -d ":" -f 2,3)"
tesseract "$i" stdout -c page_separator='' --psm 6 --oem 1 --dpi 300 |  tr '\n' ' '; tr -s  [:space:] ' ';  echo;  >> text.txt
echo "$base2""  " >> time.txt

done
awk 'printf ("%s", $0); getline < "text.txt"; print $0 ' time.txt >> out.txt
sed -i 's/|/I/g' out.txt
sed -i 's/\[/I/g' out.txt
#sed -i 's/L//g' out.txt
#sed -i 's/=//g' out.txt
sed -i 's/.$//' out.txt
sed -i 's/.$//' out.txt

while read line ; do
sed "/[[:alpha:]]/ !d" >> sub.txt
done <out.txt
exit

制作蓝色 L 的部分是为了确保每一行都有用于时间戳匹配的内容。

更新的 IVAN SRT 脚本

#!/bin/bash -x

sub="$1"            # path to sub file
OLD=$IFS            # remember current delimiter
IFS=$'\n'           # set delimiter to the new line
raw=( $(cat $sub) ) # load sub into raw array
IFS=$OLD            # set default delimiter back

reset () 
    unset raw[0]        # remove 1-st item from array
    raw=( "$raw[@]" ) # rearange array


output () 
   
    printf "00:$time1 --> 00:$time3\n$text1\n\n"
    
    

speen () 
    time3=$time2
    reset
    test=( "$raw[@]::2" ) # get two more items
    test2=( $test[0] )    # split 2-nd item
    time2=$test2[0]       # get 2-nd timing
    text2=$test2[@]:1     # get 2-nd text
    
    # if only one item in test than this is the end, return
    
            
    [[ "$test[1]" ]] ||  printf "00:$time1 --> 00:$time2\n$text1\n\n"; raw=; return; 
    #   compare,     speen more if match,  print ang go further if not 
    
    [[ "$text1" == "$text2" ]] && speen || output


N=1 # set counter
while [[ "$raw[@]" ]]; do # loop through data
    echo $((N++))       # print and inc counter
    test1=( $raw )      # get 1-st item
    time1=$test1[0]   # get 1-st timing
    text1=$test1[@]:1
    # get 1-st text
    speen
done

我刚刚添加了第三个时间变量来将旧的 time2 值保存为 time3。基本上消除空白时间戳行打破了他的匹配。我意识到 time2 是第一个不匹配的时间戳。所以我需要从最后一个循环中保存前一个。因此time3=$time2 然后休息 time2 值。然后使用旧时间2(现在时间3)打印子字符串。

【问题讨论】:

在循环中使用read 将行放入数组中。将当前文本与先前文本进行比较以检测更改。在更改时,输出先前文本的第一次和最后一次出现以生成新格式。挑战主要是让最后一个正确并处理任何文本的缺失。 主要问题是 OCR 软件产生的微小变化。我需要某种粗略的匹配。 【参考方案1】:

到此结束

#!/bin/bash

sub=file            # path to sub file
OLD=$IFS            # remember current delimiter
IFS=$'\n'           # set delimiter to the new line
raw=( $(cat $sub) ) # load sub into raw array
IFS=$OLD            # set default delimiter back

reset () 
    unset raw[0]        # remove 1-st item from array
    raw=( "$raw[@]" ) # rearange array


output () 
    text1=$text1//|/I # change | to I in text
    text1=$text1//[/I # change [ to I in text
    printf "$time1 --> $time2\n$text1\n\n"    


speen () 
    reset
    test=( "$raw[@]::2" ) # get two more items
    test2=( $test[0] )    # split 2-nd item
    time2=$test2[0]       # get 2-nd timing
    text2=$test2[@]:1     # get 2-nd text
    # if only one item in test than this is the end, return
    [[ "$test[1]" ]] ||  printf "$time1 --> $time2\n$text1\n\n"; raw=; return; 
    #   compare,     speen more if match,  print ang go further if not 
    [[ "$text1" == "$text2" ]] && speen || output


N=1 # set counter
while [[ "$raw[@]" ]]; do # loop through data
    echo $((N++))       # print and inc counter
    test1=( $raw )      # get 1-st item
    time1=$test1[0]   # get 1-st timing
    text1=$test1[@]:1 # get 1-st text
    speen
done

【讨论】:

这真的很接近我想要的。这几乎是完美的。真的需要一种方法来去除我在后期处理和制作中添加的绒毛字符|和 [ 与我匹配,这基本上就是我所需要的。我可能会添加一些测试处理来处理最后一点。并通过一个字符串测试来匹配任何匹配一个单独的字符。 完成变量替换。 是的,我刚刚编写了一个小脚本来解决这个问题以及其他一些问题。我只是在使用 sed。 从好的方面来说,我对代码的理解已经足够好了,我想我可以添加我需要的内容,并喊出更多的边缘情况。 使用一段时间后,我注意到|[ 的过滤会破坏字符串匹配。这个周末我得研究一下。

以上是关于我需要一些关于 SRT 字幕文本处理的想法的主要内容,如果未能解决你的问题,请参考以下文章

正在处理中的导入字幕

vvt在线转换srt

使用 javascript 显示字幕

如何从 SubRip .srt 文件中仅提取文本(剥离时间码)?

如何从.srt文件中获取Python中给定时间戳的文本

SRT字幕格式