如何从 Google Cloud text-to-speech API 获取 SSML 时间戳

Posted 2023-02-16

技术标签:

【中文标题】如何从 Google Cloud text-to-speech API 获取 SSML 时间戳【英文标题】：How to get SSML timestamps from Google Cloud text-to-speech API 【发布时间】：2019-12-14 08:34:51 【问题描述】：

我想通过 Google Cloud text-to-speech API 使用 SSML markers 来请求音频流中这些标记的计时。这些时间戳是必要的，以便为用户提供效果提示、单词/部分突出显示和反馈。

我发现this question 是相关的，尽管问题是指每个单词的时间戳，而不是 SSML  标记。

以下 API 请求返回 OK，但显示缺少请求的标记数据。这是使用Cloud Text-to-Speech API v1。


 "voice": 
  "languageCode": "en-US"
 ,
 "input": 
  "ssml": "<speak>First, <mark name=\"a\"/> second, <mark name=\"b\"/> third.</speak>"
 ,
 "audioConfig": 
  "audioEncoding": "mp3"

回复：


 "audioContent":"//NExAAAAANIAAAAABcFAThYGJqMWA..."

它只提供没有任何上下文信息的合成音频。

是否有我忽略的 API 请求可以公开有关这些标记的信息，例如 IBM Watson 和 Amazon Polly 的情况？

【问题讨论】：

您找到解决方案了吗？看起来谷歌的 api 不支持语音标记。对吗？ 【参考方案1】：

Cloud Text-to-Speech API v1beta1 似乎支持此功能：https://cloud.google.com/text-to-speech/docs/reference/rest/v1beta1/text/synthesize#TimepointType

您可以使用https://texttospeech.googleapis.com/v1beta1/text:synthesize。将TimepointType 设置为SSML_MARK。如果不设置该字段，默认不返回时间点。

【讨论】：

这个怎么写？ " TimepointType: "SSML_MARK"?【参考方案2】：

在撰写本文时，时间点数据在 Google 云文本转语音的 v1beta1 版本中可用。

除了默认访问权限之外，我无需登录任何额外的开发人员计划即可访问测试版。

在 Python 中导入（例如）来自：

from google.cloud import texttospeech as tts

到：

from google.cloud import texttospeech_v1beta1 as tts

漂亮又简单。

我需要修改发送综合请求的默认方式以包含enable_time_pointing 标志。

我发现，在查看machine-readable API description here 和阅读我已经下载的 Python 库代码时，我发现了这一点。

谢天谢地，通用版本中的源代码还包括v1beta 版本 - 感谢 Google！

我在下面放了一个可运行的示例。运行它需要与通用文本转语音示例相同的身份验证和设置，您可以按照官方文档获得。

这就是它对我的作用（为了便于阅读，略微格式化）：

$ python tools/try-marks.py
Marks content written to file: .../demo.json
Audio content written to file: .../demo.mp3

$ cat demo.json
[
  "sec": 0.4300000071525574, "name": "here",
  "sec": 0.9234582781791687, "name": "there"
]

示例如下：

import json
from pathlib import Path
from google.cloud import texttospeech_v1beta1 as tts


def go_ssml(basename: Path, ssml):
    client = tts.TextToSpeechClient()
    voice = tts.VoiceSelectionParams(
        language_code="en-AU",
        name="en-AU-Wavenet-B",
        ssml_gender=tts.SsmlVoiceGender.MALE,
    )

    response = client.synthesize_speech(
        request=tts.SynthesizeSpeechRequest(
            input=tts.SynthesisInput(ssml=ssml),
            voice=voice,
            audio_config=tts.AudioConfig(audio_encoding=tts.AudioEncoding.MP3),
            enable_time_pointing=[
                tts.SynthesizeSpeechRequest.TimepointType.SSML_MARK]
        )
    )

    # cheesy conversion of array of Timepoint proto.Message objects into plain-old data
    marks = [dict(sec=t.time_seconds, name=t.mark_name)
             for t in response.timepoints]

    name = basename.with_suffix('.json')
    with name.open('w') as out:
        json.dump(marks, out)
        print(f'Marks content written to file: name')

    name = basename.with_suffix('.mp3')
    with name.open('wb') as out:
        out.write(response.audio_content)
        print(f'Audio content written to file: name')


go_ssml(Path.cwd() / 'demo', """
    <speak>
    Go from <mark name="here"/> here, to <mark name="there"/> there!
    </speak>
    """)

【讨论】：

这拯救了我的一天，非常感谢！

以上是关于如何从 Google Cloud text-to-speech API 获取 SSML 时间戳的主要内容，如果未能解决你的问题，请参考以下文章

如何从 Google Cloud text-to-speech API 获取 SSML <mark> 时间戳