如果已经把音频作为输入,为什么还需要谷歌的WaveNet模型来生成音频?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如果已经把音频作为输入,为什么还需要谷歌的WaveNet模型来生成音频?相关的知识,希望对你有一定的参考价值。

我花了很多时间来理解这个问题。谷歌的WaveNet工作 (也用于他们的DeepVoice模式),但对一些很基本的方面还是感到困惑。我指的是 Wavenet的这个Tensorflow实现。

论文的第2页说

"在本文中,我们介绍了一种新的直接在原始音频波形上操作的生成模型。".

如果我们已经有了 原始音波,为什么我们需要WaveNet?这不是模型应该生成的吗?

当我打印出模型时,它显示 input 仅仅是一个浮动值,在 input_convolution 内核,因为它的形状是1x1x128。输入中的那1个浮点数代表什么?我是不是遗漏了什么?

 `inference/input_convolution/kernel:0 (float32_ref 1x1x128) [128, bytes: 512`]

下面更多的层次。

---------
Variables: name (type shape) [size]
---------
inference/ConvTranspose1D_layer_0/kernel:0 (float32_ref 1x11x80x80) [70400, bytes: 281600]
inference/ConvTranspose1D_layer_0/bias:0 (float32_ref 80) [80, bytes: 320]
inference/ConvTranspose1D_layer_1/kernel:0 (float32_ref 1x25x80x80) [160000, bytes: 640000]
inference/ConvTranspose1D_layer_1/bias:0 (float32_ref 80) [80, bytes: 320]
inference/input_convolution/kernel:0 (float32_ref 1x1x128) [128, bytes: 512]
inference/input_convolution/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_0/residual_block_skip_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_0/residual_block_skip_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_0/residual_block_out_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_0/residual_block_out_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_1/residual_block_causal_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_1/residual_block_causal_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_1/residual_block_cin_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]
inference/ResidualConv1DGLU_1/residual_block_cin_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_1/residual_block_skip_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_1/residual_block_skip_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_1/residual_block_out_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_1/residual_block_out_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 128) [128, bytes: 512]
答案

生成式网络通常在条件概率上进行操作,得到的是 new_element 鉴于 old_element(s). 用数学术语来说。

conditional probability

正如谷歌论文中所定义的那样 正如你所看到的,网络需要从一些东西(the x1...xt-1 - 过去的值) ,它不能从头开始。你可以把它想象成网络需要一个主题,它会告诉它你对什么类型感兴趣;重金属和乡村有 略有不同 氛围。

如果你喜欢,你可以自己生成这个启动波形:正弦波、白噪声或更复杂的东西。一旦你运行网络,它就会开始输出新的数值,最终成为它的输入。

以上是关于如果已经把音频作为输入,为什么还需要谷歌的WaveNet模型来生成音频?的主要内容,如果未能解决你的问题,请参考以下文章

Goldwave生成wave音频数据

Maglev -- 谷歌的负载均衡器

如何将 WAVE_MAPPER 音频线与其音频设备相关联

音频帧包含啥?

Android 自定义View - 柱状波形图 wave view

谷歌商店不能连接到谷歌服务器?