如果已经把音频作为输入,为什么还需要谷歌的WaveNet模型来生成音频?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如果已经把音频作为输入,为什么还需要谷歌的WaveNet模型来生成音频?相关的知识,希望对你有一定的参考价值。
我花了很多时间来理解这个问题。谷歌的WaveNet工作 (也用于他们的DeepVoice模式),但对一些很基本的方面还是感到困惑。我指的是 Wavenet的这个Tensorflow实现。
论文的第2页说
"在本文中,我们介绍了一种新的直接在原始音频波形上操作的生成模型。".
如果我们已经有了 原始音波,为什么我们需要WaveNet?这不是模型应该生成的吗?
当我打印出模型时,它显示 input
仅仅是一个浮动值,在 input_convolution
内核,因为它的形状是1x1x128。输入中的那1个浮点数代表什么?我是不是遗漏了什么?
`inference/input_convolution/kernel:0 (float32_ref 1x1x128) [128, bytes: 512`]
下面更多的层次。
---------
Variables: name (type shape) [size]
---------
inference/ConvTranspose1D_layer_0/kernel:0 (float32_ref 1x11x80x80) [70400, bytes: 281600]
inference/ConvTranspose1D_layer_0/bias:0 (float32_ref 80) [80, bytes: 320]
inference/ConvTranspose1D_layer_1/kernel:0 (float32_ref 1x25x80x80) [160000, bytes: 640000]
inference/ConvTranspose1D_layer_1/bias:0 (float32_ref 80) [80, bytes: 320]
inference/input_convolution/kernel:0 (float32_ref 1x1x128) [128, bytes: 512]
inference/input_convolution/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_0/residual_block_skip_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_0/residual_block_skip_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_0/residual_block_out_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_0/residual_block_out_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_1/residual_block_causal_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_1/residual_block_causal_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_1/residual_block_cin_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]
inference/ResidualConv1DGLU_1/residual_block_cin_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_1/residual_block_skip_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_1/residual_block_skip_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_1/residual_block_out_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_1/residual_block_out_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 128) [128, bytes: 512]
答案
生成式网络通常在条件概率上进行操作,得到的是 new_element
鉴于 old_element(s)
. 用数学术语来说。
正如谷歌论文中所定义的那样 正如你所看到的,网络需要从一些东西(the x1...xt-1
- 过去的值) ,它不能从头开始。你可以把它想象成网络需要一个主题,它会告诉它你对什么类型感兴趣;重金属和乡村有 略有不同 氛围。
如果你喜欢,你可以自己生成这个启动波形:正弦波、白噪声或更复杂的东西。一旦你运行网络,它就会开始输出新的数值,最终成为它的输入。
以上是关于如果已经把音频作为输入,为什么还需要谷歌的WaveNet模型来生成音频?的主要内容,如果未能解决你的问题,请参考以下文章