athena

module

Subpackages

Package Contents

Classes

SpeechRecognitionDatasetBuilder SpeechRecognitionDatasetBuilder
SpeechRecognitionDatasetKaldiIOBuilder SpeechRecognitionDatasetKaldiIOBuilder
SpeechSynthesisDatasetBuilder SpeechSynthesisDatasetBuilder
SpeechDatasetBuilder SpeechDatasetBuilder
SpeechDatasetKaldiIOBuilder SpeechDatasetKaldiIOBuilder
SpeakerRecognitionDatasetBuilder SpeakerRecognitionDatasetBuilder
SpeakerVerificationDatasetBuilder SpeakerVerificationDatasetBuilder
LanguageDatasetBuilder LanguageDatasetBuilder
FeatureNormalizer Feature Normalizer
TextFeaturizer The main text featurizer interface
PositionalEncoding positional encoding can be used in transformer
Collapse4D callapse4d can be used in cnn-lstm for speech processing
TdnnLayer An implement of Tdnn Layer
Gelu Gaussian Error Linear Unit.
MultiHeadAttention Multi-head attention
BahdanauAttention the Bahdanau Attention
HanAttention Refer to [Hierarchical Attention Networks for Document Classification]
MatchAttention Refer to [Learning Natural Language Inference with LSTM]
Transformer A transformer model. User is able to modify the attributes as needed. The architecture
TransformerEncoder TransformerEncoder is a stack of N encoder layers
TransformerDecoder TransformerDecoder is a stack of N decoder layers
TransformerEncoderLayer TransformerEncoderLayer is made up of self-attn and feedforward network.
TransformerDecoderLayer TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
ResnetBasicBlock Basic block of resnet
BaseModel Base class for model.
SpeechTransformer Standard implementation of a SpeechTransformer. Model mainly consists of three parts:
SpeechTransformer2 Decoder for SpeechTransformer2 works for two pass schedual sampling
Tacotron2 An implementation of Tacotron2
TTSTransformer TTS version of SpeechTransformer. Model mainly consists of three parts:
FastSpeech Reference: Fastspeech: Fast, robust and controllable text to speech
MaskedPredictCoding implementation for MPC pretrain model
DeepSpeechModel a sample implementation of CTC model
MtlTransformerCtc In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to
RNNLM Standard implementation of a RNNLM. Model mainly consists of embeding layer,
NeuralTranslateTransformer This is an example of seq2seq model using transformer
SpeakerResnet A sample implementation of resnet 34
BaseSolver Base Solver.
HorovodSolver A multi-processer solver based on Horovod
DecoderSolver DecoderSolver
SynthesisSolver SynthesisSolver
CTCLoss CTC LOSS
Seq2SeqSparseCategoricalCrossentropy Seq2SeqSparseCategoricalCrossentropy LOSS
CTCAccuracy CTCAccuracy
Seq2SeqSparseCategoricalAccuracy Seq2SeqSparseCategoricalAccuracy
Checkpoint A wrapper for Tensorflow checkpoint
WarmUpLearningSchedule WarmUp Learning rate schedule for Adam
WarmUpAdam WarmUpAdam Implementation
ExponentialDecayLearningRateSchedule ExponentialDecayLearningRateSchedule
ExponentialDecayAdam WarmUpAdam Implementation
HParams Class to hold a set of hyperparameters as name-value pairs.
BeamSearchDecoder Beam search decoding used in seq2seq decoder layer

Functions

make_positional_encoding(position, d_model) generate a postional encoding list
collapse4d(x, name=None) reshape from [N T D C] -> [N T D*C]
gelu(x) Gaussian Error Linear Unit.
register_and_parse_hparams(default_config: dict, config=None, **kwargs) register default config and parse
generate_square_subsequent_mask(size) Generate a square mask for the sequence. The masked positions are filled with float(1.0).
get_wave_file_length(wave_file) get the wave file length(duration) in ms
set_default_summary_writer(summary_directory=None)
class athena.SpeechRecognitionDatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechRecognitionDatasetBuilder

default_config
num_class

@property

Returns:the max_index of the vocabulary + 1
Return type:int
speaker_list

@property

Returns:the speaker list
Return type:list
audio_featurizer_func

return the audio_featurizer function

sample_type

@property

Returns:sample_type of the dataset:
{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}
Return type:dict
sample_shape

@property

Returns:sample_shape of the dataset:
{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}
Return type:dict
sample_signature

@property

Returns:sample_signature of the dataset:
{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}
Return type:dict
reload_config(self, config)

reload the config

preprocess_data(self, file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

load_csv(self, file_path)

load csv file

__getitem__(self, index)

get a sample

Parameters:index (int) – index of the entries
Returns:sample:
{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
}
Return type:dict
__len__(self)

return the number of data samples

filter_sample_by_unk(self)

filter samples which contain unk

filter_sample_by_input_length(self)

filter samples by input length

The length of filterd samples will be in [min_length, max_length)

Returns:a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker)
Return type:entries
filter_sample_by_output_length(self)

filter samples by output length

The length of filterd samples will be in [min_length, max_length)

Returns:a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker)
Return type:entries
compute_cmvn_if_necessary(self, is_necessary=True)

compute cmvn file

class athena.SpeechRecognitionDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechRecognitionDatasetKaldiIOBuilder

default_config
num_class

return the max_index of the vocabulary + 1

speaker_list

return the speaker list

audio_featurizer_func

return the audio_featurizer function

sample_type
sample_shape
sample_signature
reload_config(self, config)

reload the config

preprocess_data(self, file_dir, apply_sort_filter=True)

Generate a list of tuples (feat_key, speaker).

load_scps(self, file_dir)

load kaldi-format feats.scp, labels.scp and utt2spk (optional)

__getitem__(self, index)
__len__(self)

return the number of data samples

filter_sample_by_unk(self)

filter samples which contain unk

filter_sample_by_input_length(self)

filter samples by input length

The length of filterd samples will be in [min_length, max_length)

Returns:a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker)
Return type:entries
filter_sample_by_output_length(self)

filter samples by output length

The length of filterd samples will be in [min_length, max_length)

Returns:a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker)
Return type:entries
compute_cmvn_if_necessary(self, is_necessary=True)

compute cmvn file

class athena.SpeechSynthesisDatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechSynthesisDatasetBuilder

default_config
num_class

@property

Returns:the max_index of the vocabulary
Return type:int
speaker_list

return the speaker list

audio_featurizer_func

return the audio_featurizer function

feat_dim

return the number of feature dims

sample_type

@property

Returns:sample_type of the dataset:
{
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32
}
Return type:dict
sample_shape

@property

Returns:sample_shape of the dataset:
{
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "speaker": tf.TensorShape([])
}
Return type:dict
sample_signature

@property

Returns:sample_signature of the dataset:
{
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}
Return type:dict
reload_config(self, config)

reload the config

preprocess_data(self, file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

load_csv(self, file_path)

load csv file

__getitem__(self, index)

get a sample

Parameters:index (int) – index of the entries
Returns:sample:
{
    "input": text,
    "input_length": text_length,
    "output_length": audio_feat_length,
    "output": audio_feat,
    "speaker": self.speakers_dict[speaker]
}
Return type:dict
__len__(self)

return the number of data samples

filter_sample_by_unk(self)

filter samples which contain unk

filter_sample_by_input_length(self)

filter samples by input length

The length of filterd samples will be in [min_length, max_length)

Returns:a filtered list of tuples (wav_filename, wav_len, transcript, speaker)
Return type:entries
filter_sample_by_output_length(self)

filter samples by output length

The length of filterd samples will be in [min_length, max_length)

Returns:a filtered list of tuples (wav_filename, wav_len, transcripts, speaker)
Return type:entries
compute_cmvn_if_necessary(self, is_necessary=True)

compute cmvn file

class athena.SpeechDatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechDatasetBuilder

default_config
num_class

@property

Returns:the target dim
Return type:int
speaker_list

return the speaker list

audio_featurizer_func

return the audio_featurizer function

sample_type

@property

Returns:sample_type of the dataset:
{
    "input": tf.float32,
"input_length": tf.int32,
"output": tf.float32,
"output_length": tf.int32,
}
Return type:dict
sample_shape

@property

Returns:sample_shape of the dataset:
{
    "input": tf.TensorShape(
    [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}
Return type:dict
sample_signature

@property

Returns:sample_signature of the dataset:
{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}
Return type:dict
reload_config(self, config)

reload the config

preprocess_data(self, file_path)

generate a list of tuples (wav_filename, wav_length_ms, speaker).

load_csv(self, file_path)

load csv file

__getitem__(self, index)

get a sample

Parameters:index (int) – index of the entries
Returns:sample:
{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}
Return type:dict
__len__(self)

return the number of data samples

filter_sample_by_input_length(self)

filter samples by input length

The length of filterd samples will be in [min_length, max_length)

Parameters:
  • = [min_len, max_len] (self.hparams.input_length_range) –
  • min_len – the minimal length(ms)
  • max_len – the maximal length(ms)
Returns:

a filtered list of tuples (wav_filename, wav_len, speaker)

Return type:

entries

compute_cmvn_if_necessary(self, is_necessary=True)

compute cmvn file

class athena.SpeechDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechDatasetKaldiIOBuilder

default_config
num_class

return the max_index of the vocabulary

speaker_list

return the speaker list

audio_featurizer_func

return the audio_featurizer function

sample_type
sample_shape
sample_signature
reload_config(self, config)

reload the config

preprocess_data(self, file_dir, apply_sort_filter=True)

generate a list of tuples (feat_key, speaker).

load_scps(self, file_dir)

load kaldi-format feats.scp and utt2spk (optional)

__getitem__(self, index)
__len__(self)

return the number of data samples

filter_sample_by_input_length(self)

filter samples by input length

The length of filterd samples will be in [min_length, max_length)

Returns:a filtered list of tuples (wav_filename, wav_len, speaker)
Return type:entries
compute_cmvn_if_necessary(self, is_necessary=True)

compute cmvn file

class athena.SpeakerRecognitionDatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeakerRecognitionDatasetBuilder

default_config
num_class

@property

Returns:the number of speakers
Return type:int
sample_type

@property

Returns:sample_type of the dataset:
{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32
}
Return type:dict
sample_shape

@property

Returns:sample_shape of the dataset:
{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None])
}
Return type:dict
sample_signature

@property

Returns:sample_signature of the dataset:
{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}
Return type:dict
reload_config(self, config)

reload the config

preprocess_data(self, data_csv_path)

generate a list of tuples (wav_filename, wav_length_ms, speaker_id, speaker_name).

cut_features(self, feature)

cut acoustic featuers

load_csv(self, data_csv_path)

load csv file

__getitem__(self, index)

get a sample

Parameters:index (int) – index of the entries
Returns:sample:
{
    "input": feat,
    "input_length": feat_length,
    "output_length": 1,
    "output": spkid
}
Return type:dict
__len__(self)

return the number of data samples

filter_sample_by_input_length(self)

filter samples by input length

The length of filterd samples will be in [min_length, max_length)

Returns:a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker)
Return type:entries
compute_cmvn_if_necessary(self, is_necessary=True)

compute cmvn file

class athena.SpeakerVerificationDatasetBuilder(config=None)

Bases: athena.data.datasets.speaker_recognition.SpeakerRecognitionDatasetBuilder

SpeakerVerificationDatasetBuilder

sample_type

@property

Returns:sample_type of the dataset:
{
    "input_a": tf.float32,
    "input_b": tf.float32,
    "output": tf.int32
}
Return type:dict
sample_shape

@property

Returns:sample_shape of the dataset:
{
    "input_a": tf.TensorShape([None, dim, nc]),
    "input_b": tf.TensorShape([None, dim, nc]),
    "output": tf.TensorShape([None])
}
Return type:dict
sample_signature

@property

Returns:sample_signature of the dataset:
{
    "input_a": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_b":tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}
Return type:dict
preprocess_data(self, data_csv_path)

generate a list of tuples (wav_filename_a, speaker_a, wav_filename_b, speaker_b, label).

__getitem__(self, index)

get a sample

Parameters:index (int) – index of the entries
Returns:sample:
{
    "input_a": feat_a,
    "input_b": feat_b,
    "output": [label]
}
Return type:dict
class athena.LanguageDatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

LanguageDatasetBuilder

default_config
num_class

@property

Returns:the max_index of the vocabulary
Return type:int
input_vocab_size

@property

Returns:the input vocab size
Return type:int
sample_type

@property

Returns:sample_type of the dataset:
{
    "input": tf.int32,
    "input_length": tf.int32,
    "output": tf.int32,
    "output_length": tf.int32,
}
Return type:dict
sample_shape

@property

Returns:sample_shape of the dataset:
{
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "output_length": tf.TensorShape([]),
}
Return type:dict
sample_signature

@property

Returns:sample_signature of the dataset:
{
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}
Return type:dict
load_csv(self, file_path)

load csv file

__getitem__(self, index)

get a sample

Parameters:index (int) – index of the entries
Returns:sample:
{
    "input": input_labels,
    "input_length": input_length,
    "output": output_labels,
    "output_length": output_length,
}
Return type:dict
__len__(self)

return the number of data samples

class athena.FeatureNormalizer(cmvn_file=None)

Feature Normalizer

__call__(self, feat_date, speaker, reverse=False)
apply_cmvn(self, feat_data, speaker, reverse=False)

TODO: docstring

compute_cmvn(self, entries, speakers, featurizer, feature_dim, num_cmvn_workers=1)

Compute cmvn for filtered entries

compute_cmvn_kaldiio(self, entries, speakers, kaldi_io_feats, feature_dim)

Compute cmvn for filtered entries using kaldi-format data

load_cmvn(self)

TODO: docstring

save_cmvn(self)

TODO: docstring

class athena.TextFeaturizer(config=None)

The main text featurizer interface

supported_model
default_config
model_type

the model type

unk_index

return the unk index

load_model(self, model_file)

load model

delete_punct(self, tokens)

delete punctuation tokens

__len__(self)
encode(self, texts)

Convert a sentence to a list of ids, with special tokens added.

decode(self, sequences)

Conver a list of ids to a sentence

athena.make_positional_encoding(position, d_model)

generate a postional encoding list

athena.collapse4d(x, name=None)

reshape from [N T D C] -> [N T D*C] using tf.shape(x), which generate a tensor instead of x.shape

athena.gelu(x)

Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415 :param x: float Tensor to perform activation.

Returns:x with the GELU activation applied.
class athena.PositionalEncoding(d_model, max_position=800, scale=False)

Bases: tensorflow.keras.layers.Layer

positional encoding can be used in transformer

call(self, x)

call function

class athena.Collapse4D

Bases: tensorflow.keras.layers.Layer

callapse4d can be used in cnn-lstm for speech processing reshape from [N T D C] -> [N T D*C]

call(self, x)
class athena.TdnnLayer(context, output_dim, use_bias=False, **kwargs)

Bases: tensorflow.keras.layers.Layer

An implement of Tdnn Layer :param context: a int of left and right context, or :param a list of context indexes, e.g.: :type a list of context indexes, e.g.: -2, 0, 2 :param output_dim: the dim of the linear transform

call(self, x, training=None, mask=None)
class athena.Gelu

Bases: tensorflow.keras.layers.Layer

Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415 :param x: float Tensor to perform activation.

Returns:x with the GELU activation applied.
call(self, x)
class athena.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)

Bases: tensorflow.keras.layers.Layer

Multi-head attention

Multi-head attention consists of four parts: * Linear layers and split into heads. * Scaled dot-product attention. * Concatenation of heads. * Final linear layer. Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.

Parameters:
  • param1 (int) – The first parameter.
  • param2 (str) – The second parameter.
Returns:

The return value. True for success, False otherwise.

Return type:

bool

split_heads(self, x, batch_size)

Split the last dimension into (num_heads, depth).

Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)

call(self, v, k, q, mask)

call function

class athena.BahdanauAttention(units, input_dim=1024)

Bases: tensorflow.keras.Model

the Bahdanau Attention

call(self, query, values)

call function

class athena.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)

Bases: tensorflow.keras.layers.Layer

Refer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf) wrap with tf.variable_scope(name, reuse=tf.AUTO_REUSE): Input shape: (Batch size, steps, features) Output shape: (Batch size, features)

build(self, input_shape)

build in keras layer

call(self, inputs, training=None, mask=None)

call function in keras

compute_output_shape(self, input_shape)

compute output shape

_masked_softmax(self, logits, mask, axis)

Compute softmax with input mask.

class athena.MatchAttention(config, **kwargs)

Bases: tensorflow.keras.layers.Layer

Refer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170) wrap with tf.variable_scope(name, reuse=tf.AUTO_REUSE): Input shape: (Batch size, steps, features) Output shape: (Batch size, steps, features)

call(self, tensors)

Attention layer.

class athena.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, custom_encoder=None, custom_decoder=None)

Bases: tensorflow.keras.layers.Layer

A transformer model. User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805) model with corresponding parameters.

Parameters:
  • d_model – the number of expected features in the encoder/decoder inputs (default=512).
  • nhead – the number of heads in the multiheadattention models (default=8).
  • num_encoder_layers – the number of sub-encoder-layers in the encoder (default=6).
  • num_decoder_layers – the number of sub-decoder-layers in the decoder (default=6).
  • dim_feedforward – the dimension of the feedforward network model (default=2048).
  • dropout – the dropout value (default=0.1).
  • activation – the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
  • custom_encoder – custom encoder (default=None).
  • custom_decoder – custom decoder (default=None).
Examples::
>>> transformer_model = Transformer(nhead=16, num_encoder_layers=12)
>>> src = tf.random.normal((10, 32, 512))
>>> tgt = tf.random.normal((20, 32, 512))
>>> out = transformer_model(src, tgt)

Note: A full example to apply nn.Transformer module for the word language model is available in https://github.com/pytorch/examples/tree/master/word_language_model

call(self, src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, return_encoder_output=False, return_attention_weights=False, training=None)

Take in and process masked source/target sequences.

Parameters:
  • src – the sequence to the encoder (required).
  • tgt – the sequence to the decoder (required).
  • src_mask – the additive mask for the src sequence (optional).
  • tgt_mask – the additive mask for the tgt sequence (optional).
  • memory_mask – the additive mask for the encoder output (optional).
  • src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).
  • tgt_key_padding_mask – the ByteTensor mask for tgt keys per batch (optional).
  • memory_key_padding_mask – the ByteTensor mask for memory keys per batch (optional).
Shape:
  • src: \((N, S, E)\).
  • tgt: \((N, T, E)\).
  • src_mask: \((N, S)\).
  • tgt_mask: \((N, T)\).
  • memory_mask: \((N, S)\).

Note: [src/tgt/memory]_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.

  • output: \((N, T, E)\).

Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.

where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number

Examples

>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
class athena.TransformerEncoder(encoder_layers)

Bases: tensorflow.keras.layers.Layer

TransformerEncoder is a stack of N encoder layers

Parameters:
  • encoder_layer – an instance of the TransformerEncoderLayer() class (required).
  • num_layers – the number of sub-encoder-layers in the encoder (required).
  • norm – the layer normalization component (optional).
Examples::
>>> encoder_layer = [TransformerEncoderLayer(d_model=512, nhead=8)
>>>                    for _ in range(num_layers)]
>>> transformer_encoder = TransformerEncoder(encoder_layer)
>>> src = torch.rand(10, 32, 512)
>>> out = transformer_encoder(src)
call(self, src, src_mask=None, training=None)

Pass the input through the endocder layers in turn.

Parameters:
  • src – the sequnce to the encoder (required).
  • mask – the mask for the src sequence (optional).
Shape:
see the docs in Transformer class.
set_unidirectional(self, uni=False)

whether to apply trianglar masks to make transformer unidirectional

class athena.TransformerDecoder(decoder_layers)

Bases: tensorflow.keras.layers.Layer

TransformerDecoder is a stack of N decoder layers

Parameters:
  • decoder_layer – an instance of the TransformerDecoderLayer() class (required).
  • num_layers – the number of sub-decoder-layers in the decoder (required).
  • norm – the layer normalization component (optional).
Examples::
>>> decoder_layer = [TransformerDecoderLayer(d_model=512, nhead=8)
>>>                     for _ in range(num_layers)]
>>> transformer_decoder = TransformerDecoder(decoder_layer)
>>> memory = torch.rand(10, 32, 512)
>>> tgt = torch.rand(20, 32, 512)
>>> out = transformer_decoder(tgt, memory)
call(self, tgt, memory, tgt_mask=None, memory_mask=None, return_attention_weights=False, training=None)

Pass the inputs (and mask) through the decoder layer in turn.

Parameters:
  • tgt – the sequence to the decoder (required).
  • memory – the sequnce from the last layer of the encoder (required).
  • tgt_mask – the mask for the tgt sequence (optional).
  • memory_mask – the mask for the memory sequence (optional).
Shape:
see the docs in Transformer class.
class athena.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, ffn=None)

Bases: tensorflow.keras.layers.Layer

TransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.

Parameters:
  • d_model – the number of expected features in the input (required).
  • nhead – the number of heads in the multiheadattention models (required).
  • dim_feedforward – the dimension of the feedforward network model (default=2048).
  • dropout – the dropout value (default=0.1).
  • activation – the activation function of intermediate layer, relu or gelu (default=relu).
Examples::
>>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)
>>> src = tf.random(10, 32, 512)
>>> out = encoder_layer(src)
call(self, src, src_mask=None, training=None)

Pass the input through the endocder layer.

Parameters:
  • src – the sequnce to the encoder layer (required).
  • mask – the mask for the src sequence (optional).
Shape:
see the docs in Transformer class.
set_unidirectional(self, uni=False)

whether to apply trianglar masks to make transformer unidirectional

class athena.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu')

Bases: tensorflow.keras.layers.Layer

TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.

Parameters:
  • d_model – the number of expected features in the input (required).
  • nhead – the number of heads in the multiheadattention models (required).
  • dim_feedforward – the dimension of the feedforward network model (default=2048).
  • dropout – the dropout value (default=0.1).
  • activation – the activation function of intermediate layer, relu or gelu (default=relu).
Examples::
>>> decoder_layer = TransformerDecoderLayer(d_model=512, nhead=8)
>>> memory = tf.random(10, 32, 512)
>>> tgt = tf.random(20, 32, 512)
>>> out = decoder_layer(tgt, memory)
call(self, tgt, memory, tgt_mask=None, memory_mask=None, training=None)

Pass the inputs (and mask) through the decoder layer.

Parameters:
  • tgt – the sequence to the decoder layer (required).
  • memory – the sequnce from the last layer of the encoder (required).
  • tgt_mask – the mask for the tgt sequence (optional).
  • memory_mask – the mask for the memory sequence (optional).
Shape:
see the docs in Transformer class.
class athena.ResnetBasicBlock(num_filter, stride=1)

Bases: tensorflow.keras.layers.Layer

Basic block of resnet Reference to paper “Deep residual learning for image recognition”

call(self, inputs)

call model

make_downsample_layer(self, num_filter, stride)

perform downsampling using conv layer with stride != 1

class athena.BaseModel(**kwargs)

Bases: tensorflow.keras.Model

Base class for model.

call(self, samples, training=None)

call model

get_loss(self, outputs, samples, training=None)

get loss

compute_logit_length(self, samples)

compute the logit length

reset_metrics(self)

reset the metrics

prepare_samples(self, samples)

for special data prepare carefully: do not change the shape of samples

restore_from_pretrained_model(self, pretrained_model, model_type='')

restore from pretrained model

decode(self, samples, hparams, decoder)

decode interface

class athena.SpeechTransformer(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config
call(self, samples, training: bool = None)

call model

static _create_masks(x, input_length, y)

Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).

compute_logit_length(self, samples)

used for get logit length

time_propagate(self, history_logits, history_predictions, step, enc_outputs)

TODO: doctring last_predictions: the predictions of last time_step, [beam_size] history_predictions: the predictions of history from 0 to time_step,

[beam_size, time_steps]

states: (step)

decode(self, samples, hparams, decoder, return_encoder=False)

beam search decoding :param samples: the data source to be decoded :param hparams: decoding configs are included here :param decoder: it contains the main decoding operations :param return_encoder: if it is True,

encoder_output and input_mask will be returned
Returns:
the corresponding decoding results
shape: [batch_size, seq_length] it will be returned only if return_encoder is False
encoder_output: the encoder output computed in decode mode
shape: [batch_size, seq_length, hsize]
input_mask: it is masked by input length
shape: [batch_size, 1, 1, seq_length] encoder_output and input_mask will be returned only if return_encoder is True
Return type:predictions
restore_from_pretrained_model(self, pretrained_model, model_type='')

restore from pretrained model

deploy(self)

deployment function

inference_one_step(self, enc_outputs, cur_input, inner_packed_states_array)

call back function for WFST decoder

Parameters:
  • enc_outputs – outputs and mask of encoder
  • cur_input – input sequence for transformer, type: list
  • inner_packed_states_array – inner states need to be record, type: tuple
Returns:

log scores for all labels inner_packed_states_array: inner states for next iterator

Return type:

scores

class athena.SpeechTransformer2(data_descriptions, config=None)

Bases: athena.models.speech_transformer.SpeechTransformer

Decoder for SpeechTransformer2 works for two pass schedual sampling

call(self, samples, training: bool = None)

call model

mix_target_sequence(self, gold_token, predicted_token, training, top_k=5)

to mix gold token and prediction param gold_token: true labels param predicted_token: predictions by first pass return: mix of the gold_token and predicted_token

class athena.Tacotron2(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

An implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

default_config
_pad_and_reshape(self, outputs, ori_lens, reverse=False)
Parameters:
  • outputs – true labels, shape: [batch, y_steps, feat_dim]
  • ori_lens – scalar
Returns:

it has to be reshaped to match reduction_factor

shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]

Return type:

reshaped_outputs

call(self, samples, training: bool = None)

call model

initialize_input_y(self, y)
Parameters:y – the true label, shape: [batch, y_steps, feat_dim]
Returns:
zeros will be padded as one step to the start step
shape: [batch, y_steps+1, feat_dim]
Return type:y0
initialize_states(self, encoder_output, input_length)
Parameters:
  • encoder_output – encoder outputs, shape: [batch, x_step, eunits]
  • input_length – shape: [batch]
Returns:

initial states of rnns in decoder

[rnn layers, 2, batch, dunits]

prev_attn_weight: initial attention weights, [batch, x_steps] prev_context: initial context, [batch, eunits]

Return type:

prev_rnn_states

time_propagate(self, encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)
Parameters:
  • encoder_output – encoder output (batch, x_steps, eunits).
  • input_length – (batch,)
  • prev_y – one step of true labels or predicted labels (batch, feat_dim).
  • prev_rnn_states – previous rnn states [layers, 2, states] for lstm
  • prev_attn_weight – previous attention weights, shape: [batch, x_steps]
  • prev_context – previous context vector: [batch, attn_dim]
  • training – if it is training mode
Returns:

shape: [batch, feat_dim] logit: shape: [batch, reduction_factor] current_rnn_states: [rnn_layers, 2, batch, dunits] attn_weight: [batch, x_steps]

Return type:

out

get_loss(self, outputs, samples, training=None)

get loss

synthesize(self, samples)

Synthesize acoustic features from the input texts :param samples: the data source to be synthesized

Returns:the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights
Return type:after_outs
_synthesize_post_net(self, before_outs, logits_stack)
class athena.TTSTransformer(data_descriptions, config=None)

Bases: athena.models.tacotron2.Tacotron2

TTS version of SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself Reference: Neural Speech Synthesis with Transformer Network

default_config
static _create_masks(y, output_length, x)

Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).

call(self, samples, training: bool = None)
time_propagate(self, encoder_output, memory_mask, outs, step)

Synthesize one step frames :param encoder_output: the encoder output, shape: [batch, x_steps, eunits] :param memory_mask: the encoder output mask, shape: [batch, 1, 1, x_steps] :param outs: previous outputs :type outs: TensorArray :param step: the current step number

Returns:new frame outpus, shape: [batch, feat_dim * reduction_factor] logit: new stop token prediction logit, shape: [batch, reduction_factor] attention_weights (list): the corresponding attention weights,
each element in the list represents the attention weights of one decoder layer shape: [batch, num_heads, seq_len_q, seq_len_k]
Return type:out
synthesize(self, samples)

Synthesize acoustic features from the input texts :param samples: the data source to be synthesized

Returns:the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights
Return type:after_outs
class athena.FastSpeech(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Reference: Fastspeech: Fast, robust and controllable text to speech (http://papers.nips.cc/paper/8580-fastspeech-fast-robust-and-controllable-text-to-speech.pdf)

default_config
set_teacher_model(self, teacher_model, teacher_type)
restore_from_pretrained_model(self, pretrained_model, model_type='')
get_loss(self, outputs, samples, training=None)
_feedforward_decoder(self, encoder_output, duration_indexes, duration_sequences, output_length, training)
call(self, samples, training: bool = None)
synthesize(self, samples)
class athena.MaskedPredictCoding(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

implementation for MPC pretrain model :param num_filters: a int type number, i.e the number of filters in cnn :param d_model: a int type number, i.e dimension of model :param num_heads: number of heads in transformer :param num_encoder_layers: number of layer in encoder :param dff: a int type number, i.e dimension of model :param rate: rate of dropout layers :param chunk_size: number of consecutive masks, i.e 1 or 3 :param keep_probability: probability not to be masked :param mode: train mode, i.e MPC: pretrain :param max_pool_layers: index of max pool layers in encoder, default is -1

default_config
call(self, samples, training: bool = None)

used for training :param samples is a dict, including keys: ‘input’, ‘input_length’, ‘output_length’, ‘output’

input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank
Returns:
MPC outputs to fit acoustic features
encoder_outputs: Transformer encoder outputs, Tensor, shape is (batch, seqlen, dim)
get_loss(self, logits, samples, training=None)

get MPC loss :param logitsdd: MPC output

Returns:MPC L1 loss
compute_logit_length(self, samples)
generate_mpc_mask(self, input_data)

generate mask for pretraining :param acoustic features: i.e F-bank

Returns:mask tensor
prepare_samples(self, samples)

for special data prepare carefully: do not change the shape of samples

class athena.DeepSpeechModel(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

a sample implementation of CTC model

default_config
call(self, samples, training=None)

call function

compute_logit_length(self, samples)

used for get logit length

class athena.MtlTransformerCtc(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.

SUPPORTED_MODEL
default_config
call(self, samples, training=None)

call function in keras layers

get_loss(self, outputs, samples, training=None)

get loss used for training

compute_logit_length(self, samples)

compute the logit length

reset_metrics(self)

reset the metrics

restore_from_pretrained_model(self, pretrained_model, model_type='')

A more general-purpose interface for pretrained model restoration

Parameters:
  • pretrained_model – checkpoint path of mpc model
  • model_type – the type of pretrained model to restore
decode(self, samples, hparams, decoder)

Initialization of the model for decoding, decoder is called here to create predictions :param samples: the data source to be decoded :param hparams: decoding configs are included here :param decoder: it contains the main decoding operations

Returns:the corresponding decoding results
Return type:predictions
deploy(self)

deployment function

class athena.RNNLM(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Standard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn

default_config
call(self, samples, training: bool = None)

call model

save_model(self, path)

for saving model and current weight, path is h5 file name, like ‘my_model.h5’ usage: new_model = tf.keras.models.load_model(path)

get_loss(self, logits, samples, training=None)

get loss

class athena.NeuralTranslateTransformer(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

This is an example of seq2seq model using transformer

default_config
call(self, samples, training=None)
static _create_masks(x, y)
class athena.SpeakerResnet(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

A sample implementation of resnet 34 Reference to paper “Deep residual learning for image recognition” The implementation is the same as the standard resnet with 34 weighted layers, excepts using only 1/4 amount of filters to reduce computation. config:

task: “speaker_identification” or “speaker_verification”
default_config
call(self, samples, training=None)

call model

init_loss(self, loss)

initialize loss function

get_loss(self, outputs, samples, training=None)
get_eer(self, outputs, samples, training=False)

get equal error rates

make_resnet_block_layer(self, num_filter, num_blocks, stride=1)
class athena.BaseSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)

Bases: tensorflow.keras.Model

Base Solver.

default_config
static initialize_devices(visible_gpu_idx=None)

initialize hvd devices, should be called firstly

static clip_by_norm(grads, norm)

clip norm using tf.clip_by_norm

train_step(self, samples)

train the model 1 step

train(self, dataset, total_batches=-1)

Update the model in 1 epoch

evaluate_step(self, samples)

evaluate the model 1 step

evaluate(self, dataset, epoch)

evaluate the model

class athena.HorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)

Bases: athena.solver.BaseSolver

A multi-processer solver based on Horovod

static initialize_devices(visible_gpu_idx=None)

initialize hvd devices, should be called firstly

train_step(self, samples)

train the model 1 step

train(self, dataset, total_batches=-1)

Update the model in 1 epoch

evaluate(self, dataset, epoch=0)

evaluate the model

class athena.DecoderSolver(model, config=None, lm_model=None)

Bases: athena.solver.BaseSolver

DecoderSolver

default_config
decode(self, dataset, rank_size=1)

decode the model

class athena.SynthesisSolver(model, data_descriptions=None, config=None)

Bases: athena.solver.BaseSolver

SynthesisSolver

default_config
synthesize(self, dataset)

synthesize using vocoder on dataset

class athena.CTCLoss(logits_time_major=False, blank_index=-1, name='CTCLoss')

Bases: tensorflow.keras.losses.Loss

CTC LOSS CTC LOSS implemented with Tensorflow

__call__(self, logits, samples, logit_length=None)
class athena.Seq2SeqSparseCategoricalCrossentropy(num_classes, eos=-1, by_token=False, by_sequence=True, from_logits=True, label_smoothing=0.0)

Bases: tensorflow.keras.losses.CategoricalCrossentropy

Seq2SeqSparseCategoricalCrossentropy LOSS CategoricalCrossentropy calculated at each character for each sequence in a batch

__call__(self, logits, samples, logit_length=None)
class athena.CTCAccuracy(name='CTCAccuracy')

Bases: athena.metrics.CharactorAccuracy

CTCAccuracy Inherits CharactorAccuracy and implements CTC accuracy calculation

__call__(self, logits, samples, logit_length=None)

Accumulate errors and counts, logit_length is the output length of encoder

class athena.Seq2SeqSparseCategoricalAccuracy(eos, name='Seq2SeqSparseCategoricalAccuracy')

Bases: athena.metrics.CharactorAccuracy

Seq2SeqSparseCategoricalAccuracy Inherits CharactorAccuracy and implements Attention accuracy calculation

__call__(self, logits, samples, logit_length=None)

Accumulate errors and counts

class athena.Checkpoint(checkpoint_directory=None, model=None, **kwargs)

Bases: tensorflow.train.Checkpoint

A wrapper for Tensorflow checkpoint

Parameters:
  • checkpoint_directory – the directory for checkpoint
  • summary_directory – the directory for summary used in Tensorboard
  • provide the optimizer and model (__init__) –
  • save the model (__call__) –

Example

transformer = SpeechTransformer(target_vocab_size=dataset_builder.target_dim) optimizer = tf.keras.optimizers.Adam() ckpt = Checkpoint(checkpoint_directory=’./train’, summary_directory=’./event’,

transformer=transformer, optimizer=optimizer)

solver = BaseSolver(transformer) for epoch in dataset:

ckpt()
_compare_and_save_best(self, loss, metrics, save_path)

compare and save the best model with best_loss and N best metrics

compute_nbest_avg(self, model_avg_num)

restore n-best avg checkpoint

__call__(self, loss=None, metrics=None)
restore_from_best(self)

restore from the best model

class athena.WarmUpLearningSchedule(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0)

Bases: tensorflow.keras.optimizers.schedules.LearningRateSchedule

WarmUp Learning rate schedule for Adam

Used as :
optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512),
beta_1=0.9, beta_2=0.98, epsilon=1e-9)
Args :
model_dim is the something related to total model parameters warmup_steps is the highest learning rate iters
Returns:return the learning rate

Idea from the paper: Attention Is All You Need

__call__(self, step)
class athena.WarmUpAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)

Bases: tensorflow.keras.optimizers.Adam

WarmUpAdam Implementation

default_config
class athena.ExponentialDecayLearningRateSchedule(initial_lr=0.005, decay_steps=10000, decay_rate=0.5, start_decay_steps=30000, final_lr=1e-05)

Bases: tensorflow.keras.optimizers.schedules.LearningRateSchedule

ExponentialDecayLearningRateSchedule

Used as :
optimizer = tf.keras.optimizers.Adam( learning_rate = ExponentialDecayLearningRate(0.01, 100))
Args :
initial_lr, decay_steps
Returns:initial_lr * (0.5 ** (step // decay_steps))
__call__(self, step)
class athena.ExponentialDecayAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-06, amsgrad=False, name='WarmUpAdam', **kwargs)

Bases: tensorflow.keras.optimizers.Adam

WarmUpAdam Implementation

default_config
class athena.HParams(model_structure=None, **kwargs)

Bases: object

Class to hold a set of hyperparameters as name-value pairs.

A HParams object holds hyperparameters used to build and train a model, such as the number of hidden units in a neural net layer or the learning rate to use when training.

You first create a HParams object by specifying the names and values of the hyperparameters.

To make them easily accessible the parameter names are added as direct attributes of the class. A typical usage is as follows:

```python # Create a HParams object specifying names and values of the model # hyperparameters: hparams = HParams(learning_rate=0.1, num_hidden_units=100)

# The hyperparameter are available as attributes of the HParams object: hparams.learning_rate ==> 0.1 hparams.num_hidden_units ==> 100 ```

Hyperparameters have type, which is inferred from the type of their value passed at construction type. The currently supported types are: integer, float, boolean, string, and list of integer, float, boolean, or string.

You can override hyperparameter values by calling the [parse()](#HParams.parse) method, passing a string of comma separated name=value pairs. This is intended to make it possible to override any hyperparameter values from a single command-line flag to which the user passes ‘hyper-param=value’ pairs. It avoids having to define one flag for each hyperparameter.

The syntax expected for each value depends on the type of the parameter. See parse() for a description of the syntax.

Example:

```python # Define a command line flag to pass name=value pairs. # For example using argparse: import argparse parser = argparse.ArgumentParser(description=’Train my model.’) parser.add_argument(‘–hparams’, type=str,

help=’Comma separated list of “name=value” pairs.’)

args = parser.parse_args() … def my_program():

# Create a HParams object specifying the names and values of the # model hyperparameters: hparams = tf.HParams(learning_rate=0.1, num_hidden_units=100,

activations=[‘relu’, ‘tanh’])

# Override hyperparameters values by parsing the command line hparams.parse(args.hparams)

# If the user passed –hparams=learning_rate=0.3 on the command line # then ‘hparams’ has the following attributes: hparams.learning_rate ==> 0.3 hparams.num_hidden_units ==> 100 hparams.activations ==> [‘relu’, ‘tanh’]

# If the hyperparameters are in json format use parse_json: hparams.parse_json(‘{“learning_rate”: 0.3, “activations”: “relu”}’)

```

_HAS_DYNAMIC_ATTRIBUTES = True
add_hparam(self, name, value)

Adds {name, value} pair to hyperparameters.

Parameters:
  • name – Name of the hyperparameter.
  • value – Value of the hyperparameter. Can be one of the following types:
  • float, string, int list, float list, or string list. (int,) –
Raises:

ValueError – if one of the arguments is invalid.

set_hparam(self, name, value)

Set the value of an existing hyperparameter.

This function verifies that the type of the value matches the type of the existing hyperparameter.

Parameters:
  • name – Name of the hyperparameter.
  • value – New value of the hyperparameter.
Raises:
  • KeyError – If the hyperparameter doesn’t exist.
  • ValueError – If there is a type mismatch.
del_hparam(self, name)

Removes the hyperparameter with key ‘name’.

Does nothing if it isn’t present.

Parameters:name – Name of the hyperparameter.
parse(self, values, ignore_unknown=False)

Override existing hyperparameter values, parsing new values from a string.

See parse_values for more detail on the allowed format for values.

Parameters:
  • values – String. Comma separated list of name=value pairs where ‘value’
  • follow the syntax described above. (must) –
Returns:

The HParams instance.

Raises:
  • ValueError – If values cannot be parsed or a hyperparameter in values
  • doesn’t exist.
override_from_dict(self, values_dict)

Override existing hyperparameter values, parsing new values from a dictionary.

Parameters:

values_dict – Dictionary of name:value pairs.

Returns:

The HParams instance.

Raises:
  • KeyError – If a hyperparameter in values_dict doesn’t exist.
  • ValueError – If values_dict cannot be parsed.
set_model_structure(self, model_structure)
get_model_structure(self)
to_json(self, indent=None, separators=None, sort_keys=False)

Serializes the hyperparameters into JSON.

Parameters:
  • indent – If a non-negative integer, JSON array elements and object members
  • be pretty-printed with that indent level. An indent level of 0, or (will) –
  • will only insert newlines. None (negative,) –
  • compact representation. (most) –
  • separators – Optional (item_separator, key_separator) tuple. Default is
  • ', ' (`(',) –

    ’)`.

  • sort_keys – If True, the output dictionaries will be sorted by key.
Returns:

A JSON string.

parse_json(self, values_json)

Override existing hyperparameter values, parsing new values from a json object.

Parameters:

values_json – String containing a json object of name:value pairs.

Returns:

The HParams instance.

Raises:
  • KeyError – If a hyperparameter in values_json doesn’t exist.
  • ValueError – If values_json cannot be parsed.
values(self)

Return the hyperparameter values as a Python dictionary.

Returns:A dictionary with hyperparameter names as keys. The values are the hyperparameter values.
get(self, key, default=None)

Returns the value of key if it exists, else default.

__contains__(self, key)
__str__(self)

Return str(self).

__repr__(self)

Return repr(self).

static _get_kind_name(param_type, is_list)

Returns the field name given parameter type and is_list.

Parameters:
  • param_type – Data type of the hparam.
  • is_list – Whether this is a list.
Returns:

A string representation of the field name.

Raises:

ValueError – If parameter type is not recognized.

instantiate(self)
append(self, hp)
athena.register_and_parse_hparams(default_config: dict, config=None, **kwargs)

register default config and parse

athena.generate_square_subsequent_mask(size)

Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).

athena.get_wave_file_length(wave_file)

get the wave file length(duration) in ms

Parameters:wave_file – the path of wave file
Returns:the length(ms) of the wave file
athena.set_default_summary_writer(summary_directory=None)
class athena.BeamSearchDecoder(num_class, sos, eos, beam_size)

Beam search decoding used in seq2seq decoder layer This layer is used for evaluation

static build_decoder(hparams, num_class, sos, eos, decoder_one_step, lm_model=None)

Allocate the time propagating function of the decoder, initialize the decoder

Parameters:
  • hparams – the decoding configs are included here
  • num_class – the size of the vocab
  • sos – the start symbol index
  • eos – the end symbol index
  • decoder_one_step – the time propagating function of the decoder
  • lm_model – the initialized languange model
Returns:

the initialized beam search decoder

Return type:

beam_search_decoder

set_lm_model(self, lm_model)

set the lm_model :param lm_model: lm_model

set_ctc_scorer(self, ctc_scorer)

set the ctc_scorer :param ctc_scorer: the ctc scorer

beam_search_score(self, candidate_holder, encoder_outputs)

Call the time propagating function, fetch the acoustic score at the current step

If needed, call the auxiliary scorer and update cand_states in candidate_holder

Parameters:
  • candidate_holder – the param cand_seqs and the cand_logits of it is needed in the transformer decoder to calculate the output. type: CandidateHolder
  • encoder_outputs – the encoder outputs from the transformer encoder. type: tuple, (encoder_outputs, input_mask)
deal_with_completed(self, completed_scores, completed_seqs, completed_length, new_scores, candidate_holder, max_seq_len)
Add the new calculated completed seq with its score to completed seqs
select top beam_size probable completed seqs with these corresponding scores
Parameters:
  • completed_scores – the scores of completed_seqs
  • completed_seqs – historical top beam_size probable completed seqs
  • completed_length – the length of completed_seqs
  • new_scores – the current time step scores
  • candidate_holder
  • max_seq_len – the maximum acceptable output length
Returns:

new top probable scores completed_seqs: new top probable completed seqs completed_length: new top probable seq length

Return type:

new_completed_scores

deal_with_uncompleted(self, new_scores, new_cand_logits, new_states, candidate_holder)
select top probable candidate seqs from new predictions with its scores
update candidate_holder based on top probable candidates
Parameters:
  • new_scores – the current time step prediction scores
  • new_cand_logits – historical prediction scores
  • new_states – updated states
  • candidate_holder
Returns:

cand_seqs, cand_logits, cand_states,

cand_scores, cand_parents will be updated here and sent to next time step

Return type:

candidate_holder

__call__(self, cand_seqs, cand_states, init_states, encoder_outputs)
Parameters:
  • cand_seqs – TensorArray list, element shape: [beam]
  • cand_states – [history_predictions]
  • init_states – state list
  • encoder_outputs – (encoder_outputs, memory_mask, …)
Returns:

the sequence with highest score

Return type:

completed_seqs