athena.models.tacotron2

tacotron2 implementation

Module Contents

Classes

Tacotron2 An implementation of Tacotron2
class athena.models.tacotron2.Tacotron2(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

An implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

default_config
_pad_and_reshape(self, outputs, ori_lens, reverse=False)
Parameters:
  • outputs – true labels, shape: [batch, y_steps, feat_dim]
  • ori_lens – scalar
Returns:

it has to be reshaped to match reduction_factor

shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]

Return type:

reshaped_outputs

call(self, samples, training: bool = None)

call model

initialize_input_y(self, y)
Parameters:y – the true label, shape: [batch, y_steps, feat_dim]
Returns:
zeros will be padded as one step to the start step
shape: [batch, y_steps+1, feat_dim]
Return type:y0
initialize_states(self, encoder_output, input_length)
Parameters:
  • encoder_output – encoder outputs, shape: [batch, x_step, eunits]
  • input_length – shape: [batch]
Returns:

initial states of rnns in decoder

[rnn layers, 2, batch, dunits]

prev_attn_weight: initial attention weights, [batch, x_steps] prev_context: initial context, [batch, eunits]

Return type:

prev_rnn_states

time_propagate(self, encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)
Parameters:
  • encoder_output – encoder output (batch, x_steps, eunits).
  • input_length – (batch,)
  • prev_y – one step of true labels or predicted labels (batch, feat_dim).
  • prev_rnn_states – previous rnn states [layers, 2, states] for lstm
  • prev_attn_weight – previous attention weights, shape: [batch, x_steps]
  • prev_context – previous context vector: [batch, attn_dim]
  • training – if it is training mode
Returns:

shape: [batch, feat_dim] logit: shape: [batch, reduction_factor] current_rnn_states: [rnn_layers, 2, batch, dunits] attn_weight: [batch, x_steps]

Return type:

out

get_loss(self, outputs, samples, training=None)

get loss

synthesize(self, samples)

Synthesize acoustic features from the input texts :param samples: the data source to be synthesized

Returns:the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights
Return type:after_outs
_synthesize_post_net(self, before_outs, logits_stack)