Mismatching dimensions of input/output in the WaveNet model for text-to-speech generation?

Question

I have been trying to understand the model of how speech generation works, particularly in WaveNet model by Google. I was referring to the original WaveNet paper and this implementation:

I find the model very confusing in the input it takes and the output it generates, and some of the layer dimensions didn't seem to match based on what I understood from the wavenet paper, or am I misinterpreting something?

What is the input to the WaveNet, isn't this a mel-spectrum input and not just 1 floating point value for raw audio? E.g. the input kernel layer shows as shaped 1x1x128. Isn't the input to the input_convolution layer the mel-spectrum frames, which are 80 float values * 10,000 max_decoder_steps, so the in_channels for this conv1d layer should be 80 instead of 1?

     inference/input_convolution/kernel:0 (float32_ref 1x1x128) [128, bytes: 512]

Is there reason for upsampling stride values to be [11, 25], like are the specific numbers 11 and 25 special or relevant in affecting other shapes/dimensions?

inference/ConvTranspose1D_layer_0/kernel:0 (float32_ref 1x11x80x80) [70400, bytes: 281600]
inference/ConvTranspose1D_layer_1/kernel:0 (float32_ref 1x25x80x80) [160000, bytes: 640000]

Why is the input-channels in residual_block_causal_conv 128 and residual_block_cin_conv 80? What exactly is their inputs? (e.g. is it mel-spectrum or just a raw floating point value?) Is the wavenet-vocoder generating just 1 float value per 1 input mel-spectrum frame of 80 floats?

inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]

I was able to print the whole Wavenet network using the print(tf.trainable_variables()), but the model still seems very confusing.

EDIT: below are some of the initial layers printed out using tensorflow, but not sure why it doesn't print the dilation of 2 for residual_block_causal_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 3x128x256.

>>>slim.model_analyzer.analyze_vars(model_vars, print_info=True)
---------
Variables: name (type shape) [size]
---------
inference/ConvTranspose1D_layer_0/kernel:0 (float32_ref 1x11x80x80) [70400, bytes: 281600]
inference/ConvTranspose1D_layer_0/bias:0 (float32_ref 80) [80, bytes: 320]
inference/ConvTranspose1D_layer_1/kernel:0 (float32_ref 1x25x80x80) [160000, bytes: 640000]
inference/ConvTranspose1D_layer_1/bias:0 (float32_ref 80) [80, bytes: 320]
inference/input_convolution/kernel:0 (float32_ref 1x1x128) [128, bytes: 512]
inference/input_convolution/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_0/residual_block_causal_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]
inference/ResidualConv1DGLU_0/residual_block_cin_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_0/residual_block_skip_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_0/residual_block_skip_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_0/residual_block_out_conv_ResidualConv1DGLU_0/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_0/residual_block_out_conv_ResidualConv1DGLU_0/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_1/residual_block_causal_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 3x128x256) [98304, bytes: 393216]
inference/ResidualConv1DGLU_1/residual_block_causal_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_1/residual_block_cin_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x80x256) [20480, bytes: 81920]
inference/ResidualConv1DGLU_1/residual_block_cin_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 256) [256, bytes: 1024]
inference/ResidualConv1DGLU_1/residual_block_skip_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_1/residual_block_skip_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 128) [128, bytes: 512]
inference/ResidualConv1DGLU_1/residual_block_out_conv_ResidualConv1DGLU_1/kernel:0 (float32_ref 1x128x128) [16384, bytes: 65536]
inference/ResidualConv1DGLU_1/residual_block_out_conv_ResidualConv1DGLU_1/bias:0 (float32_ref 128) [128, bytes: 512]

Nikolay Shmyrev · Accepted Answer · 2020-06-14T21:04:35.100

Wavenet design is well hidden in the paper on purpose, let me explain at least basic things.

What is the input to the WaveNet, isn't this a mel-spectrum input and not just 1 floating point value for raw audio?

It is never a floating point actually. In most practical implementation mulaw encoding which quantizes input to 256 one-hot values is used (and in this particular model 128 for speed but with less quality). And not a single value but the big history chunk is used with dilated convolutions.

Also in most cases the input is conditioned by mel values. So for every 0.025s window overlapped with 300 samples hop we have mel (80 floats) then we generate speech sample by sample inside this window.

cin in the code is for "conditioned input". It is also an input to the network but it just changes less frequently. It is upsampled.

Is there reason for upsampling stride values to be [11, 25], like are the specific numbers 11 and 25 special or relevant in affecting other shapes/dimensions?

11 x 25 is equal to 275, the hop size of the mel windows. See the comment here:

https://github.com/Rayhane-mamah/Tacotron-2/blob/ab5cb08a931fc842d3892ebeb27c8b8734ddd4b8/hparams.py#L55

Why is the input-channels in residual_block_causal_conv 128 and residual_block_cin_conv 80? What exactly is their inputs? (e.g. is it mel-spectrum or just a raw floating point value?) Is the wavenet-vocoder generating just 1 float value per 1 input mel-spectrum frame of 80 floats?

The one that has 80 is the mel spectrum. 128 is the causal convolution dimension (mulaw encoding dimension).

This blog has some more realistic pictures https://mc.ai/wavenet-a-network-good-to-know/

which layer above is doing mulaw encoding, if `input_convolution` then shouldn't the `input_convolution/kernel` have 80 mels (that exist in 1 spectrum) frame as input as it currently only shows conv1d with channel_in 1 and filier_size 1, ch_out 128? I understand that `cin_conv` refers to local conditioning conv and it takes mel-frames as input (so input-channels is 80), but then what's the 'main' input i.e. input `input_convolution` to Wavenet as it shows as just 1 float value not the mel-frame? — Joe Black, Jun 14 '20 at 23:07
are the upsampling layers `ConvTranspose1D_layer_0, ConvTranspose1D_layer_1` used to increase 1 mel-frame=80 floats to 275 mel-frames each with 80 values? also, are the `gate_channels 256` being divided into two 128 channels, of which one 128 goes through $sigmoid$ and other 128 through $tanh$ each outputting 128 channels? which param in hparam.py affects mulaw 128 or 256, and what `input_type="raw"` means? — Joe Black, Jun 14 '20 at 23:31

Mismatching dimensions of input/output in the WaveNet model for text-to-speech generation?

1 Answers1