athena.models.tts_transformer¶
speech transformer implementation
Module Contents¶
Classes¶
TTSTransformer |
TTS version of SpeechTransformer. Model mainly consists of three parts: |
-
class
athena.models.tts_transformer.TTSTransformer(data_descriptions, config=None)¶ Bases:
athena.models.tacotron2.Tacotron2TTS version of SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself Reference: Neural Speech Synthesis with Transformer Network
-
default_config¶
-
static
_create_masks(y, output_length, x)¶ Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).
-
call(self, samples, training: bool = None)¶
-
time_propagate(self, encoder_output, memory_mask, outs, step)¶ Synthesize one step frames :param encoder_output: the encoder output, shape: [batch, x_steps, eunits] :param memory_mask: the encoder output mask, shape: [batch, 1, 1, x_steps] :param outs: previous outputs :type outs: TensorArray :param step: the current step number
Returns: new frame outpus, shape: [batch, feat_dim * reduction_factor] logit: new stop token prediction logit, shape: [batch, reduction_factor] attention_weights (list): the corresponding attention weights, each element in the list represents the attention weights of one decoder layer shape: [batch, num_heads, seq_len_q, seq_len_k]Return type: out
-
synthesize(self, samples)¶ Synthesize acoustic features from the input texts :param samples: the data source to be synthesized
Returns: the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights Return type: after_outs
-