VTT字幕文件处理(vi + sed + awk)

Posted dingdingfish

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了VTT字幕文件处理(vi + sed + awk)相关的知识,希望对你有一定的参考价值。

收到一个字幕文件,vtt后缀,部分内容如下:

00:00:00.030 --> 00:00:01.670 align:start position:0%
 
in<00:00:00.359><c> this</c><00:00:00.539><c> episode</c><00:00:00.989><c> we're</c><00:00:01.319><c> going</c><00:00:01.410><c> to</c><00:00:01.469><c> take</c><00:00:01.620><c> a</c>

00:00:01.670 --> 00:00:01.680 align:start position:0%
in this episode we're going to take a
 

00:00:01.680 --> 00:00:05.599 align:start position:0%
in this episode we're going to take a
step<00:00:01.979><c> into</c><00:00:02.280><c> the</c><00:00:02.659><c> unknown</c><00:00:03.980><c> it's</c><00:00:04.980><c> time</c><00:00:05.250><c> to</c><00:00:05.400><c> talk</c>

00:00:05.599 --> 00:00:05.609 align:start position:0%
step into the unknown it's time to talk
 

00:00:05.609 --> 00:00:13.039 align:start position:0%
step into the unknown it's time to talk
about<00:00:10.099><c> the</c><00:00:11.099><c> non</c><00:00:11.309><c> value</c><00:00:11.730><c> is</c><00:00:11.910><c> a</c><00:00:11.940><c> placeholder</c><00:00:12.360><c> for</c>

00:00:13.039 --> 00:00:13.049 align:start position:0%
about the non value is a placeholder for
 

00:00:13.049 --> 00:00:14.690 align:start position:0%
about the non value is a placeholder for
missing<00:00:13.440><c> or</c><00:00:13.710><c> not</c><00:00:13.980><c> applicable</c><00:00:14.190><c> information</c>

00:00:14.690 --> 00:00:14.700 align:start position:0%
missing or not applicable information

先用vi删除00:开始的行,命令为g/^00:/d;然后用vi去掉以</c>结束的行,命令为g/<\\/c>$/d,输出如下:

in this episode we're going to take a


in this episode we're going to take a

step into the unknown it's time to talk


step into the unknown it's time to talk

about the non value is a placeholder for


about the non value is a placeholder for

missing or not applicable information

使用sed去掉空行,命令为sed -r '/^\\s*$/d',参考这里

# sed -r '/^\\s*$/d' /tmp/1
in this episode we're going to take a
in this episode we're going to take a
step into the unknown it's time to talk
step into the unknown it's time to talk
about the non value is a placeholder for
about the non value is a placeholder for
missing or not applicable information

最后用awk删除连续的重复行,命令为awk '!x[$0]++',参考这里,这命令牛:

in this episode we're going to take a
step into the unknown it's time to talk
about the non value is a placeholder for
missing or not applicable information

以上是关于VTT字幕文件处理(vi + sed + awk)的主要内容,如果未能解决你的问题,请参考以下文章

如何在 tvOS 中将外部 .vtt 字幕文件添加到 AVPlayerViewController

AE脚本:AE导入SubRip/SRT/TXT/VTT字幕

如何通过命令行实用程序操作文本:grep、cut、awk、sed 或 BBEdit(Grep 查找选项)

如何在视频字幕/字幕 (VTT) 中添加上标符号

关于vtt 与 srt 字幕 的相互转换

获取 Openload VTT 字幕链接