Microsoft 认知服务 SST 支持哪些音频格式？为啥 16 位 PCM x.wav 成功而 32 位 PCM y.wav 不成功？

Posted 2023-03-10

技术标签:

【中文标题】Microsoft 认知服务 SST 支持哪些音频格式？为啥 16 位 PCM x.wav 成功而 32 位 PCM y.wav 不成功？【英文标题】：What audio formats are supported by Microsoft cognitive services SST? Why does 16-bit PCM x.wav succeed while 32-bit PCM y.wav doesn't?Microsoft 认知服务 SST 支持哪些音频格式？为什么 16 位 PCM x.wav 成功而 32 位 PCM y.wav 不成功？ 【发布时间】：2019-08-02 13:43:13 【问题描述】：

我正在尝试通过 python API 使用 Microsoft 认知服务来解决语音到文本的问题。我有两个文件，harvard.wav 和 Optagelse_0.wav，我想转录它们，但我只使用 harvard.wav 成功。

文件 harvard.wav 具有以下属性：

'filename': 'harvard.wav', 'nb_streams': '1', 'format_name': 'wav', 'format_long_name': 'WAV / WAVE (Waveform Audio)', 'start_time': 'N/A', 'duration': '18.356190', 'size': '3249924.000000', 'bit_rate': '1411200.000000', 'TAG': 'encoder': 'Adobe Audition CC 2018.0 (Windows)', 'date': '2018-03-03', 'creation_time': '18\\:52\\:53', 'time_reference': '0', 'index': '0', 'codec_name': 'pcm_s16le', 'codec_long_name': 'PCM signed 16-bit little-endian', 'codec_type': 'audio', 'codec_time_base': '1/44100', 'codec_tag_string': '[1][0][0][0]', 'codec_tag': '0x0001', 'sample_rate': '44100.000000', 'channels': '2', 'bits_per_sample': '16', 'avg_frame_rate': '0/0', 'time_base': '1/44100'

而 Optagelse_0.wav 有：

'filename': 'Optagelse_0.wav', 'nb_streams': '1', 'format_name': 'wav', 'format_long_name': 'WAV / WAVE (Waveform Audio)', 'start_time': 'N/A', 'duration': '29.056000', 'size': '5578796.000000', 'bit_rate': '1536000.000000', 'index': '0', 'codec_name': 'pcm_s32le', 'codec_long_name': 'PCM signed 32-bit little-endian', 'codec_type': 'audio', 'codec_time_base': '1/48000', 'codec_tag_string': '[1][0][0][0]', 'codec_tag': '0x0001', 'sample_rate': '48000.000000', 'channels': '1', 'bits_per_sample': '32', 'avg_frame_rate': '0/0', 'time_base': '1/48000'

我已尝试根据What audio formats are supported by Azure Cognitive Services' Speech Service (SST)? 更改 harvard.wav 的采样率但没有任何改善。

speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
audio_config = speechsdk.audio.AudioConfig(filename='sound.wav')
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
result = speech_recognizer.recognize_once()

# Check the result
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print("Recognized: ".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized: ".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech Recognition canceled: ".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print("Error details: ".format(cancellation_details.error_details))

我希望得到一个转录的打印输出，但我得到了错误

Speech Recognition canceled: CancellationReason.Error
Error details: Invalid parameter or unsupported audio format in the request. Response text:"Duration":0,"Offset":0,"RecognitionStatus":"BadRequest"

【问题讨论】：

【参考方案1】：

根据Azure documentation，您需要 16 位 PCM，而 Optalgese.wav 是 32 位您的问题 - “'codec_long_name': 'PCM signed 32-bit little-endian'”

OP 能够使用 ffmpeg 将音频文件从 32 位更改为 16 位

 ffmpeg -i Optagelse.wav -acodec pcm_s16le Opt_pcm_16.wav

【讨论】：

我设法使用 ffmpeg 命令将 'Optagelse.wav' 的编解码器更改为 pcm_s16le：` ffmpeg -i Optagelse.wav -acodec pcm_s16le Opt_pcm_16.wav ` 现在我正在获取音频转录.

以上是关于Microsoft 认知服务 SST 支持哪些音频格式？为啥 16 位 PCM x.wav 成功而 32 位 PCM y.wav 不成功？的主要内容，如果未能解决你的问题，请参考以下文章