athena¶
module
Subpackages¶
athena.dataathena.data.datasetsathena.data.datasets.baseathena.data.datasets.language_setathena.data.datasets.preprocessathena.data.datasets.speaker_recognitionathena.data.datasets.speaker_recognition_testathena.data.datasets.speech_recognitionathena.data.datasets.speech_recognition_kaldiioathena.data.datasets.speech_recognition_testathena.data.datasets.speech_setathena.data.datasets.speech_set_kaldiioathena.data.datasets.speech_synthesis
athena.data.feature_normalizerathena.data.text_featurizer
athena.layersathena.modelsathena.toolsathena.transformathena.transform.featsathena.transform.feats.opsathena.transform.feats.base_frontendathena.transform.feats.cmvnathena.transform.feats.cmvn_testathena.transform.feats.fbankathena.transform.feats.fbank_pitchathena.transform.feats.fbank_pitch_testathena.transform.feats.fbank_testathena.transform.feats.framepowathena.transform.feats.framepow_testathena.transform.feats.mel_spectrumathena.transform.feats.mel_spectrum_testathena.transform.feats.mfccathena.transform.feats.mfcc_testathena.transform.feats.pitchathena.transform.feats.pitch_testathena.transform.feats.read_wavathena.transform.feats.read_wav_testathena.transform.feats.spectrumathena.transform.feats.spectrum_testathena.transform.feats.write_wavathena.transform.feats.write_wav_test
athena.transform.audio_featurizer
athena.utils
Submodules¶
Package Contents¶
Classes¶
SpeechRecognitionDatasetBuilder |
SpeechRecognitionDatasetBuilder |
SpeechRecognitionDatasetKaldiIOBuilder |
SpeechRecognitionDatasetKaldiIOBuilder |
SpeechSynthesisDatasetBuilder |
SpeechSynthesisDatasetBuilder |
SpeechDatasetBuilder |
SpeechDatasetBuilder |
SpeechDatasetKaldiIOBuilder |
SpeechDatasetKaldiIOBuilder |
SpeakerRecognitionDatasetBuilder |
SpeakerRecognitionDatasetBuilder |
SpeakerVerificationDatasetBuilder |
SpeakerVerificationDatasetBuilder |
LanguageDatasetBuilder |
LanguageDatasetBuilder |
FeatureNormalizer |
Feature Normalizer |
TextFeaturizer |
The main text featurizer interface |
PositionalEncoding |
positional encoding can be used in transformer |
Collapse4D |
callapse4d can be used in cnn-lstm for speech processing |
TdnnLayer |
An implement of Tdnn Layer |
Gelu |
Gaussian Error Linear Unit. |
MultiHeadAttention |
Multi-head attention |
BahdanauAttention |
the Bahdanau Attention |
HanAttention |
Refer to [Hierarchical Attention Networks for Document Classification] |
MatchAttention |
Refer to [Learning Natural Language Inference with LSTM] |
Transformer |
A transformer model. User is able to modify the attributes as needed. The architecture |
TransformerEncoder |
TransformerEncoder is a stack of N encoder layers |
TransformerDecoder |
TransformerDecoder is a stack of N decoder layers |
TransformerEncoderLayer |
TransformerEncoderLayer is made up of self-attn and feedforward network. |
TransformerDecoderLayer |
TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. |
ResnetBasicBlock |
Basic block of resnet |
BaseModel |
Base class for model. |
SpeechTransformer |
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: |
SpeechTransformer2 |
Decoder for SpeechTransformer2 works for two pass schedual sampling |
Tacotron2 |
An implementation of Tacotron2 |
TTSTransformer |
TTS version of SpeechTransformer. Model mainly consists of three parts: |
FastSpeech |
Reference: Fastspeech: Fast, robust and controllable text to speech |
MaskedPredictCoding |
implementation for MPC pretrain model |
DeepSpeechModel |
a sample implementation of CTC model |
MtlTransformerCtc |
In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to |
RNNLM |
Standard implementation of a RNNLM. Model mainly consists of embeding layer, |
NeuralTranslateTransformer |
This is an example of seq2seq model using transformer |
SpeakerResnet |
A sample implementation of resnet 34 |
BaseSolver |
Base Solver. |
HorovodSolver |
A multi-processer solver based on Horovod |
DecoderSolver |
DecoderSolver |
SynthesisSolver |
SynthesisSolver |
CTCLoss |
CTC LOSS |
Seq2SeqSparseCategoricalCrossentropy |
Seq2SeqSparseCategoricalCrossentropy LOSS |
CTCAccuracy |
CTCAccuracy |
Seq2SeqSparseCategoricalAccuracy |
Seq2SeqSparseCategoricalAccuracy |
Checkpoint |
A wrapper for Tensorflow checkpoint |
WarmUpLearningSchedule |
WarmUp Learning rate schedule for Adam |
WarmUpAdam |
WarmUpAdam Implementation |
ExponentialDecayLearningRateSchedule |
ExponentialDecayLearningRateSchedule |
ExponentialDecayAdam |
WarmUpAdam Implementation |
HParams |
Class to hold a set of hyperparameters as name-value pairs. |
BeamSearchDecoder |
Beam search decoding used in seq2seq decoder layer |
Functions¶
make_positional_encoding(position, d_model) |
generate a postional encoding list |
collapse4d(x, name=None) |
reshape from [N T D C] -> [N T D*C] |
gelu(x) |
Gaussian Error Linear Unit. |
register_and_parse_hparams(default_config: dict, config=None, **kwargs) |
register default config and parse |
generate_square_subsequent_mask(size) |
Generate a square mask for the sequence. The masked positions are filled with float(1.0). |
get_wave_file_length(wave_file) |
get the wave file length(duration) in ms |
set_default_summary_writer(summary_directory=None) |
-
class
athena.SpeechRecognitionDatasetBuilder(config=None)¶ Bases:
athena.data.datasets.base.BaseDatasetBuilderSpeechRecognitionDatasetBuilder
-
default_config¶
-
num_class¶ @propertyReturns: the max_index of the vocabulary + 1 Return type: int
-
speaker_list¶ @propertyReturns: the speaker list Return type: list
-
audio_featurizer_func¶ return the audio_featurizer function
-
sample_type¶ @propertyReturns: sample_type of the dataset: { "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, }
Return type: dict
-
sample_shape¶ @propertyReturns: sample_shape of the dataset: { "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), }
Return type: dict
-
sample_signature¶ @propertyReturns: sample_signature of the dataset: { "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), }
Return type: dict
-
reload_config(self, config)¶ reload the config
-
preprocess_data(self, file_path)¶ generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
-
load_csv(self, file_path)¶ load csv file
-
__getitem__(self, index)¶ get a sample
Parameters: index (int) – index of the entries Returns: sample: { "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, }
Return type: dict
-
__len__(self)¶ return the number of data samples
-
filter_sample_by_unk(self)¶ filter samples which contain unk
-
filter_sample_by_input_length(self)¶ filter samples by input length
The length of filterd samples will be in [min_length, max_length)
Returns: a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker) Return type: entries
-
filter_sample_by_output_length(self)¶ filter samples by output length
The length of filterd samples will be in [min_length, max_length)
Returns: a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker) Return type: entries
-
compute_cmvn_if_necessary(self, is_necessary=True)¶ compute cmvn file
-
-
class
athena.SpeechRecognitionDatasetKaldiIOBuilder(config=None)¶ Bases:
athena.data.datasets.base.BaseDatasetBuilderSpeechRecognitionDatasetKaldiIOBuilder
-
default_config¶
-
num_class¶ return the max_index of the vocabulary + 1
-
speaker_list¶ return the speaker list
-
audio_featurizer_func¶ return the audio_featurizer function
-
sample_type¶
-
sample_shape¶
-
sample_signature¶
-
reload_config(self, config)¶ reload the config
-
preprocess_data(self, file_dir, apply_sort_filter=True)¶ Generate a list of tuples (feat_key, speaker).
-
load_scps(self, file_dir)¶ load kaldi-format feats.scp, labels.scp and utt2spk (optional)
-
__getitem__(self, index)¶
-
__len__(self)¶ return the number of data samples
-
filter_sample_by_unk(self)¶ filter samples which contain unk
-
filter_sample_by_input_length(self)¶ filter samples by input length
The length of filterd samples will be in [min_length, max_length)
Returns: a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker) Return type: entries
-
filter_sample_by_output_length(self)¶ filter samples by output length
The length of filterd samples will be in [min_length, max_length)
Returns: a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker) Return type: entries
-
compute_cmvn_if_necessary(self, is_necessary=True)¶ compute cmvn file
-
-
class
athena.SpeechSynthesisDatasetBuilder(config=None)¶ Bases:
athena.data.datasets.base.BaseDatasetBuilderSpeechSynthesisDatasetBuilder
-
default_config¶
-
num_class¶ @propertyReturns: the max_index of the vocabulary Return type: int
-
speaker_list¶ return the speaker list
-
audio_featurizer_func¶ return the audio_featurizer function
-
feat_dim¶ return the number of feature dims
-
sample_type¶ @propertyReturns: sample_type of the dataset: { "input": tf.int32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.float32, "speaker": tf.int32 }
Return type: dict
-
sample_shape¶ @propertyReturns: sample_shape of the dataset: { "input": tf.TensorShape([None]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None, feature_dim]), "speaker": tf.TensorShape([]) }
Return type: dict
-
sample_signature¶ @propertyReturns: sample_signature of the dataset: { "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, feature_dim), dtype=tf.float32), "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32) }
Return type: dict
-
reload_config(self, config)¶ reload the config
-
preprocess_data(self, file_path)¶ generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
-
load_csv(self, file_path)¶ load csv file
-
__getitem__(self, index)¶ get a sample
Parameters: index (int) – index of the entries Returns: sample: { "input": text, "input_length": text_length, "output_length": audio_feat_length, "output": audio_feat, "speaker": self.speakers_dict[speaker] }
Return type: dict
-
__len__(self)¶ return the number of data samples
-
filter_sample_by_unk(self)¶ filter samples which contain unk
-
filter_sample_by_input_length(self)¶ filter samples by input length
The length of filterd samples will be in [min_length, max_length)
Returns: a filtered list of tuples (wav_filename, wav_len, transcript, speaker) Return type: entries
-
filter_sample_by_output_length(self)¶ filter samples by output length
The length of filterd samples will be in [min_length, max_length)
Returns: a filtered list of tuples (wav_filename, wav_len, transcripts, speaker) Return type: entries
-
compute_cmvn_if_necessary(self, is_necessary=True)¶ compute cmvn file
-
-
class
athena.SpeechDatasetBuilder(config=None)¶ Bases:
athena.data.datasets.base.BaseDatasetBuilderSpeechDatasetBuilder
-
default_config¶
-
num_class¶ @propertyReturns: the target dim Return type: int
-
speaker_list¶ return the speaker list
-
audio_featurizer_func¶ return the audio_featurizer function
-
sample_type¶ @propertyReturns: sample_type of the dataset: { "input": tf.float32, "input_length": tf.int32, "output": tf.float32, "output_length": tf.int32, }
Return type: dict
-
sample_shape¶ @propertyReturns: sample_shape of the dataset: { "input": tf.TensorShape( [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels] ), "input_length": tf.TensorShape([]), "output": tf.TensorShape([None, None]), "output_length": tf.TensorShape([]), }
Return type: dict
-
sample_signature¶ @propertyReturns: sample_signature of the dataset: { "input": tf.TensorSpec( shape=(None, None, None, None), dtype=tf.float32 ), "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32), "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), }
Return type: dict
-
reload_config(self, config)¶ reload the config
-
preprocess_data(self, file_path)¶ generate a list of tuples (wav_filename, wav_length_ms, speaker).
-
load_csv(self, file_path)¶ load csv file
-
__getitem__(self, index)¶ get a sample
Parameters: index (int) – index of the entries Returns: sample: { "input": input_data, "input_length": input_data.shape[0], "output": output_data, "output_length": output_data.shape[0], }
Return type: dict
-
__len__(self)¶ return the number of data samples
-
filter_sample_by_input_length(self)¶ filter samples by input length
The length of filterd samples will be in [min_length, max_length)
Parameters: - = [min_len, max_len] (self.hparams.input_length_range) –
- min_len – the minimal length(ms)
- max_len – the maximal length(ms)
Returns: a filtered list of tuples (wav_filename, wav_len, speaker)
Return type: entries
-
compute_cmvn_if_necessary(self, is_necessary=True)¶ compute cmvn file
-
-
class
athena.SpeechDatasetKaldiIOBuilder(config=None)¶ Bases:
athena.data.datasets.base.BaseDatasetBuilderSpeechDatasetKaldiIOBuilder
-
default_config¶
-
num_class¶ return the max_index of the vocabulary
-
speaker_list¶ return the speaker list
-
audio_featurizer_func¶ return the audio_featurizer function
-
sample_type¶
-
sample_shape¶
-
sample_signature¶
-
reload_config(self, config)¶ reload the config
-
preprocess_data(self, file_dir, apply_sort_filter=True)¶ generate a list of tuples (feat_key, speaker).
-
load_scps(self, file_dir)¶ load kaldi-format feats.scp and utt2spk (optional)
-
__getitem__(self, index)¶
-
__len__(self)¶ return the number of data samples
-
filter_sample_by_input_length(self)¶ filter samples by input length
The length of filterd samples will be in [min_length, max_length)
Returns: a filtered list of tuples (wav_filename, wav_len, speaker) Return type: entries
-
compute_cmvn_if_necessary(self, is_necessary=True)¶ compute cmvn file
-
-
class
athena.SpeakerRecognitionDatasetBuilder(config=None)¶ Bases:
athena.data.datasets.base.BaseDatasetBuilderSpeakerRecognitionDatasetBuilder
-
default_config¶
-
num_class¶ @propertyReturns: the number of speakers Return type: int
-
sample_type¶ @propertyReturns: sample_type of the dataset: { "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32 }
Return type: dict
-
sample_shape¶ @propertyReturns: sample_shape of the dataset: { "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]) }
Return type: dict
-
sample_signature¶ @propertyReturns: sample_signature of the dataset: { "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), }
Return type: dict
-
reload_config(self, config)¶ reload the config
-
preprocess_data(self, data_csv_path)¶ generate a list of tuples (wav_filename, wav_length_ms, speaker_id, speaker_name).
-
cut_features(self, feature)¶ cut acoustic featuers
-
load_csv(self, data_csv_path)¶ load csv file
-
__getitem__(self, index)¶ get a sample
Parameters: index (int) – index of the entries Returns: sample: { "input": feat, "input_length": feat_length, "output_length": 1, "output": spkid }
Return type: dict
-
__len__(self)¶ return the number of data samples
-
filter_sample_by_input_length(self)¶ filter samples by input length
The length of filterd samples will be in [min_length, max_length)
Returns: a filtered list of tuples (wav_filename, wav_len, transcripts, speed, speaker) Return type: entries
-
compute_cmvn_if_necessary(self, is_necessary=True)¶ compute cmvn file
-
-
class
athena.SpeakerVerificationDatasetBuilder(config=None)¶ Bases:
athena.data.datasets.speaker_recognition.SpeakerRecognitionDatasetBuilderSpeakerVerificationDatasetBuilder
-
sample_type¶ @propertyReturns: sample_type of the dataset: { "input_a": tf.float32, "input_b": tf.float32, "output": tf.int32 }
Return type: dict
-
sample_shape¶ @propertyReturns: sample_shape of the dataset: { "input_a": tf.TensorShape([None, dim, nc]), "input_b": tf.TensorShape([None, dim, nc]), "output": tf.TensorShape([None]) }
Return type: dict
-
sample_signature¶ @propertyReturns: sample_signature of the dataset: { "input_a": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_b":tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), }
Return type: dict
-
preprocess_data(self, data_csv_path)¶ generate a list of tuples (wav_filename_a, speaker_a, wav_filename_b, speaker_b, label).
-
__getitem__(self, index)¶ get a sample
Parameters: index (int) – index of the entries Returns: sample: { "input_a": feat_a, "input_b": feat_b, "output": [label] }
Return type: dict
-
-
class
athena.LanguageDatasetBuilder(config=None)¶ Bases:
athena.data.datasets.base.BaseDatasetBuilderLanguageDatasetBuilder
-
default_config¶
-
num_class¶ @propertyReturns: the max_index of the vocabulary Return type: int
-
input_vocab_size¶ @propertyReturns: the input vocab size Return type: int
-
sample_type¶ @propertyReturns: sample_type of the dataset: { "input": tf.int32, "input_length": tf.int32, "output": tf.int32, "output_length": tf.int32, }
Return type: dict
-
sample_shape¶ @propertyReturns: sample_shape of the dataset: { "input": tf.TensorShape([None]), "input_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "output_length": tf.TensorShape([]), }
Return type: dict
-
sample_signature¶ @propertyReturns: sample_signature of the dataset: { "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), }
Return type: dict
-
load_csv(self, file_path)¶ load csv file
-
__getitem__(self, index)¶ get a sample
Parameters: index (int) – index of the entries Returns: sample: { "input": input_labels, "input_length": input_length, "output": output_labels, "output_length": output_length, }
Return type: dict
-
__len__(self)¶ return the number of data samples
-
-
class
athena.FeatureNormalizer(cmvn_file=None)¶ Feature Normalizer
-
__call__(self, feat_date, speaker, reverse=False)¶
-
apply_cmvn(self, feat_data, speaker, reverse=False)¶ TODO: docstring
-
compute_cmvn(self, entries, speakers, featurizer, feature_dim, num_cmvn_workers=1)¶ Compute cmvn for filtered entries
-
compute_cmvn_kaldiio(self, entries, speakers, kaldi_io_feats, feature_dim)¶ Compute cmvn for filtered entries using kaldi-format data
-
load_cmvn(self)¶ TODO: docstring
-
save_cmvn(self)¶ TODO: docstring
-
-
class
athena.TextFeaturizer(config=None)¶ The main text featurizer interface
-
supported_model¶
-
default_config¶
-
model_type¶ the model type
-
unk_index¶ return the unk index
-
load_model(self, model_file)¶ load model
-
delete_punct(self, tokens)¶ delete punctuation tokens
-
__len__(self)¶
-
encode(self, texts)¶ Convert a sentence to a list of ids, with special tokens added.
-
decode(self, sequences)¶ Conver a list of ids to a sentence
-
-
athena.make_positional_encoding(position, d_model)¶ generate a postional encoding list
-
athena.collapse4d(x, name=None)¶ reshape from [N T D C] -> [N T D*C] using tf.shape(x), which generate a tensor instead of x.shape
-
athena.gelu(x)¶ Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415 :param x: float Tensor to perform activation.
Returns: x with the GELU activation applied.
-
class
athena.PositionalEncoding(d_model, max_position=800, scale=False)¶ Bases:
tensorflow.keras.layers.Layerpositional encoding can be used in transformer
-
call(self, x)¶ call function
-
-
class
athena.Collapse4D¶ Bases:
tensorflow.keras.layers.Layercallapse4d can be used in cnn-lstm for speech processing reshape from [N T D C] -> [N T D*C]
-
call(self, x)¶
-
-
class
athena.TdnnLayer(context, output_dim, use_bias=False, **kwargs)¶ Bases:
tensorflow.keras.layers.LayerAn implement of Tdnn Layer :param context: a int of left and right context, or :param a list of context indexes, e.g.: :type a list of context indexes, e.g.: -2, 0, 2 :param output_dim: the dim of the linear transform
-
call(self, x, training=None, mask=None)¶
-
-
class
athena.Gelu¶ Bases:
tensorflow.keras.layers.LayerGaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415 :param x: float Tensor to perform activation.
Returns: x with the GELU activation applied. -
call(self, x)¶
-
-
class
athena.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)¶ Bases:
tensorflow.keras.layers.LayerMulti-head attention
Multi-head attention consists of four parts: * Linear layers and split into heads. * Scaled dot-product attention. * Concatenation of heads. * Final linear layer. Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.
Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.
Parameters: - param1 (int) – The first parameter.
- param2 (str) – The second parameter.
Returns: The return value. True for success, False otherwise.
Return type: bool
-
split_heads(self, x, batch_size)¶ Split the last dimension into (num_heads, depth).
Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
-
call(self, v, k, q, mask)¶ call function
-
class
athena.BahdanauAttention(units, input_dim=1024)¶ Bases:
tensorflow.keras.Modelthe Bahdanau Attention
-
call(self, query, values)¶ call function
-
-
class
athena.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)¶ Bases:
tensorflow.keras.layers.LayerRefer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf) wrap with tf.variable_scope(name, reuse=tf.AUTO_REUSE): Input shape: (Batch size, steps, features) Output shape: (Batch size, features)
-
build(self, input_shape)¶ build in keras layer
-
call(self, inputs, training=None, mask=None)¶ call function in keras
-
compute_output_shape(self, input_shape)¶ compute output shape
-
_masked_softmax(self, logits, mask, axis)¶ Compute softmax with input mask.
-
-
class
athena.MatchAttention(config, **kwargs)¶ Bases:
tensorflow.keras.layers.LayerRefer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170) wrap with tf.variable_scope(name, reuse=tf.AUTO_REUSE): Input shape: (Batch size, steps, features) Output shape: (Batch size, steps, features)
-
call(self, tensors)¶ Attention layer.
-
-
class
athena.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, custom_encoder=None, custom_decoder=None)¶ Bases:
tensorflow.keras.layers.LayerA transformer model. User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805) model with corresponding parameters.
Parameters: - d_model – the number of expected features in the encoder/decoder inputs (default=512).
- nhead – the number of heads in the multiheadattention models (default=8).
- num_encoder_layers – the number of sub-encoder-layers in the encoder (default=6).
- num_decoder_layers – the number of sub-decoder-layers in the decoder (default=6).
- dim_feedforward – the dimension of the feedforward network model (default=2048).
- dropout – the dropout value (default=0.1).
- activation – the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
- custom_encoder – custom encoder (default=None).
- custom_decoder – custom decoder (default=None).
- Examples::
>>> transformer_model = Transformer(nhead=16, num_encoder_layers=12) >>> src = tf.random.normal((10, 32, 512)) >>> tgt = tf.random.normal((20, 32, 512)) >>> out = transformer_model(src, tgt)
Note: A full example to apply nn.Transformer module for the word language model is available in https://github.com/pytorch/examples/tree/master/word_language_model
-
call(self, src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, return_encoder_output=False, return_attention_weights=False, training=None)¶ Take in and process masked source/target sequences.
Parameters: - src – the sequence to the encoder (required).
- tgt – the sequence to the decoder (required).
- src_mask – the additive mask for the src sequence (optional).
- tgt_mask – the additive mask for the tgt sequence (optional).
- memory_mask – the additive mask for the encoder output (optional).
- src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).
- tgt_key_padding_mask – the ByteTensor mask for tgt keys per batch (optional).
- memory_key_padding_mask – the ByteTensor mask for memory keys per batch (optional).
- Shape:
- src: \((N, S, E)\).
- tgt: \((N, T, E)\).
- src_mask: \((N, S)\).
- tgt_mask: \((N, T)\).
- memory_mask: \((N, S)\).
Note: [src/tgt/memory]_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.
- output: \((N, T, E)\).
Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.
where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number
Examples
>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
-
class
athena.TransformerEncoder(encoder_layers)¶ Bases:
tensorflow.keras.layers.LayerTransformerEncoder is a stack of N encoder layers
Parameters: - encoder_layer – an instance of the TransformerEncoderLayer() class (required).
- num_layers – the number of sub-encoder-layers in the encoder (required).
- norm – the layer normalization component (optional).
- Examples::
>>> encoder_layer = [TransformerEncoderLayer(d_model=512, nhead=8) >>> for _ in range(num_layers)] >>> transformer_encoder = TransformerEncoder(encoder_layer) >>> src = torch.rand(10, 32, 512) >>> out = transformer_encoder(src)
-
call(self, src, src_mask=None, training=None)¶ Pass the input through the endocder layers in turn.
Parameters: - src – the sequnce to the encoder (required).
- mask – the mask for the src sequence (optional).
- Shape:
- see the docs in Transformer class.
-
set_unidirectional(self, uni=False)¶ whether to apply trianglar masks to make transformer unidirectional
-
class
athena.TransformerDecoder(decoder_layers)¶ Bases:
tensorflow.keras.layers.LayerTransformerDecoder is a stack of N decoder layers
Parameters: - decoder_layer – an instance of the TransformerDecoderLayer() class (required).
- num_layers – the number of sub-decoder-layers in the decoder (required).
- norm – the layer normalization component (optional).
- Examples::
>>> decoder_layer = [TransformerDecoderLayer(d_model=512, nhead=8) >>> for _ in range(num_layers)] >>> transformer_decoder = TransformerDecoder(decoder_layer) >>> memory = torch.rand(10, 32, 512) >>> tgt = torch.rand(20, 32, 512) >>> out = transformer_decoder(tgt, memory)
-
call(self, tgt, memory, tgt_mask=None, memory_mask=None, return_attention_weights=False, training=None)¶ Pass the inputs (and mask) through the decoder layer in turn.
Parameters: - tgt – the sequence to the decoder (required).
- memory – the sequnce from the last layer of the encoder (required).
- tgt_mask – the mask for the tgt sequence (optional).
- memory_mask – the mask for the memory sequence (optional).
- Shape:
- see the docs in Transformer class.
-
class
athena.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, ffn=None)¶ Bases:
tensorflow.keras.layers.LayerTransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
Parameters: - d_model – the number of expected features in the input (required).
- nhead – the number of heads in the multiheadattention models (required).
- dim_feedforward – the dimension of the feedforward network model (default=2048).
- dropout – the dropout value (default=0.1).
- activation – the activation function of intermediate layer, relu or gelu (default=relu).
- Examples::
>>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8) >>> src = tf.random(10, 32, 512) >>> out = encoder_layer(src)
-
call(self, src, src_mask=None, training=None)¶ Pass the input through the endocder layer.
Parameters: - src – the sequnce to the encoder layer (required).
- mask – the mask for the src sequence (optional).
- Shape:
- see the docs in Transformer class.
-
set_unidirectional(self, uni=False)¶ whether to apply trianglar masks to make transformer unidirectional
-
class
athena.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu')¶ Bases:
tensorflow.keras.layers.LayerTransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
Parameters: - d_model – the number of expected features in the input (required).
- nhead – the number of heads in the multiheadattention models (required).
- dim_feedforward – the dimension of the feedforward network model (default=2048).
- dropout – the dropout value (default=0.1).
- activation – the activation function of intermediate layer, relu or gelu (default=relu).
- Examples::
>>> decoder_layer = TransformerDecoderLayer(d_model=512, nhead=8) >>> memory = tf.random(10, 32, 512) >>> tgt = tf.random(20, 32, 512) >>> out = decoder_layer(tgt, memory)
-
call(self, tgt, memory, tgt_mask=None, memory_mask=None, training=None)¶ Pass the inputs (and mask) through the decoder layer.
Parameters: - tgt – the sequence to the decoder layer (required).
- memory – the sequnce from the last layer of the encoder (required).
- tgt_mask – the mask for the tgt sequence (optional).
- memory_mask – the mask for the memory sequence (optional).
- Shape:
- see the docs in Transformer class.
-
class
athena.ResnetBasicBlock(num_filter, stride=1)¶ Bases:
tensorflow.keras.layers.LayerBasic block of resnet Reference to paper “Deep residual learning for image recognition”
-
call(self, inputs)¶ call model
-
make_downsample_layer(self, num_filter, stride)¶ perform downsampling using conv layer with stride != 1
-
-
class
athena.BaseModel(**kwargs)¶ Bases:
tensorflow.keras.ModelBase class for model.
-
call(self, samples, training=None)¶ call model
-
get_loss(self, outputs, samples, training=None)¶ get loss
-
compute_logit_length(self, samples)¶ compute the logit length
-
reset_metrics(self)¶ reset the metrics
-
prepare_samples(self, samples)¶ for special data prepare carefully: do not change the shape of samples
-
restore_from_pretrained_model(self, pretrained_model, model_type='')¶ restore from pretrained model
-
decode(self, samples, hparams, decoder)¶ decode interface
-
-
class
athena.SpeechTransformer(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModelStandard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself
-
default_config¶
-
call(self, samples, training: bool = None)¶ call model
-
static
_create_masks(x, input_length, y)¶ Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).
-
compute_logit_length(self, samples)¶ used for get logit length
-
time_propagate(self, history_logits, history_predictions, step, enc_outputs)¶ TODO: doctring last_predictions: the predictions of last time_step, [beam_size] history_predictions: the predictions of history from 0 to time_step,
[beam_size, time_steps]states: (step)
-
decode(self, samples, hparams, decoder, return_encoder=False)¶ beam search decoding :param samples: the data source to be decoded :param hparams: decoding configs are included here :param decoder: it contains the main decoding operations :param return_encoder: if it is True,
encoder_output and input_mask will be returnedReturns: - the corresponding decoding results
- shape: [batch_size, seq_length] it will be returned only if return_encoder is False
- encoder_output: the encoder output computed in decode mode
- shape: [batch_size, seq_length, hsize]
- input_mask: it is masked by input length
- shape: [batch_size, 1, 1, seq_length] encoder_output and input_mask will be returned only if return_encoder is True
Return type: predictions
-
restore_from_pretrained_model(self, pretrained_model, model_type='')¶ restore from pretrained model
-
deploy(self)¶ deployment function
-
inference_one_step(self, enc_outputs, cur_input, inner_packed_states_array)¶ call back function for WFST decoder
Parameters: - enc_outputs – outputs and mask of encoder
- cur_input – input sequence for transformer, type: list
- inner_packed_states_array – inner states need to be record, type: tuple
Returns: log scores for all labels inner_packed_states_array: inner states for next iterator
Return type: scores
-
-
class
athena.SpeechTransformer2(data_descriptions, config=None)¶ Bases:
athena.models.speech_transformer.SpeechTransformerDecoder for SpeechTransformer2 works for two pass schedual sampling
-
call(self, samples, training: bool = None)¶ call model
-
mix_target_sequence(self, gold_token, predicted_token, training, top_k=5)¶ to mix gold token and prediction param gold_token: true labels param predicted_token: predictions by first pass return: mix of the gold_token and predicted_token
-
-
class
athena.Tacotron2(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModelAn implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
-
default_config¶
-
_pad_and_reshape(self, outputs, ori_lens, reverse=False)¶ Parameters: - outputs – true labels, shape: [batch, y_steps, feat_dim]
- ori_lens – scalar
Returns: - it has to be reshaped to match reduction_factor
shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]
Return type: reshaped_outputs
-
call(self, samples, training: bool = None)¶ call model
-
initialize_input_y(self, y)¶ Parameters: y – the true label, shape: [batch, y_steps, feat_dim] Returns: - zeros will be padded as one step to the start step
- shape: [batch, y_steps+1, feat_dim]
Return type: y0
-
initialize_states(self, encoder_output, input_length)¶ Parameters: - encoder_output – encoder outputs, shape: [batch, x_step, eunits]
- input_length – shape: [batch]
Returns: - initial states of rnns in decoder
[rnn layers, 2, batch, dunits]
prev_attn_weight: initial attention weights, [batch, x_steps] prev_context: initial context, [batch, eunits]
Return type: prev_rnn_states
-
time_propagate(self, encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)¶ Parameters: - encoder_output – encoder output (batch, x_steps, eunits).
- input_length – (batch,)
- prev_y – one step of true labels or predicted labels (batch, feat_dim).
- prev_rnn_states – previous rnn states [layers, 2, states] for lstm
- prev_attn_weight – previous attention weights, shape: [batch, x_steps]
- prev_context – previous context vector: [batch, attn_dim]
- training – if it is training mode
Returns: shape: [batch, feat_dim] logit: shape: [batch, reduction_factor] current_rnn_states: [rnn_layers, 2, batch, dunits] attn_weight: [batch, x_steps]
Return type: out
-
get_loss(self, outputs, samples, training=None)¶ get loss
-
synthesize(self, samples)¶ Synthesize acoustic features from the input texts :param samples: the data source to be synthesized
Returns: the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights Return type: after_outs
-
_synthesize_post_net(self, before_outs, logits_stack)¶
-
-
class
athena.TTSTransformer(data_descriptions, config=None)¶ Bases:
athena.models.tacotron2.Tacotron2TTS version of SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself Reference: Neural Speech Synthesis with Transformer Network
-
default_config¶
-
static
_create_masks(y, output_length, x)¶ Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).
-
call(self, samples, training: bool = None)¶
-
time_propagate(self, encoder_output, memory_mask, outs, step)¶ Synthesize one step frames :param encoder_output: the encoder output, shape: [batch, x_steps, eunits] :param memory_mask: the encoder output mask, shape: [batch, 1, 1, x_steps] :param outs: previous outputs :type outs: TensorArray :param step: the current step number
Returns: new frame outpus, shape: [batch, feat_dim * reduction_factor] logit: new stop token prediction logit, shape: [batch, reduction_factor] attention_weights (list): the corresponding attention weights, each element in the list represents the attention weights of one decoder layer shape: [batch, num_heads, seq_len_q, seq_len_k]Return type: out
-
synthesize(self, samples)¶ Synthesize acoustic features from the input texts :param samples: the data source to be synthesized
Returns: the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights Return type: after_outs
-
-
class
athena.FastSpeech(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModelReference: Fastspeech: Fast, robust and controllable text to speech (http://papers.nips.cc/paper/8580-fastspeech-fast-robust-and-controllable-text-to-speech.pdf)
-
default_config¶
-
set_teacher_model(self, teacher_model, teacher_type)¶
-
restore_from_pretrained_model(self, pretrained_model, model_type='')¶
-
get_loss(self, outputs, samples, training=None)¶
-
_feedforward_decoder(self, encoder_output, duration_indexes, duration_sequences, output_length, training)¶
-
call(self, samples, training: bool = None)¶
-
synthesize(self, samples)¶
-
-
class
athena.MaskedPredictCoding(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModelimplementation for MPC pretrain model :param num_filters: a int type number, i.e the number of filters in cnn :param d_model: a int type number, i.e dimension of model :param num_heads: number of heads in transformer :param num_encoder_layers: number of layer in encoder :param dff: a int type number, i.e dimension of model :param rate: rate of dropout layers :param chunk_size: number of consecutive masks, i.e 1 or 3 :param keep_probability: probability not to be masked :param mode: train mode, i.e MPC: pretrain :param max_pool_layers: index of max pool layers in encoder, default is -1
-
default_config¶
-
call(self, samples, training: bool = None)¶ used for training :param samples is a dict, including keys: ‘input’, ‘input_length’, ‘output_length’, ‘output’
input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bankReturns: - MPC outputs to fit acoustic features
- encoder_outputs: Transformer encoder outputs, Tensor, shape is (batch, seqlen, dim)
-
get_loss(self, logits, samples, training=None)¶ get MPC loss :param logitsdd: MPC output
Returns: MPC L1 loss
-
compute_logit_length(self, samples)¶
-
generate_mpc_mask(self, input_data)¶ generate mask for pretraining :param acoustic features: i.e F-bank
Returns: mask tensor
-
prepare_samples(self, samples)¶ for special data prepare carefully: do not change the shape of samples
-
-
class
athena.DeepSpeechModel(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModela sample implementation of CTC model
-
default_config¶
-
call(self, samples, training=None)¶ call function
-
compute_logit_length(self, samples)¶ used for get logit length
-
-
class
athena.MtlTransformerCtc(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModelIn speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.
-
SUPPORTED_MODEL¶
-
default_config¶
-
call(self, samples, training=None)¶ call function in keras layers
-
get_loss(self, outputs, samples, training=None)¶ get loss used for training
-
compute_logit_length(self, samples)¶ compute the logit length
-
reset_metrics(self)¶ reset the metrics
-
restore_from_pretrained_model(self, pretrained_model, model_type='')¶ A more general-purpose interface for pretrained model restoration
Parameters: - pretrained_model – checkpoint path of mpc model
- model_type – the type of pretrained model to restore
-
decode(self, samples, hparams, decoder)¶ Initialization of the model for decoding, decoder is called here to create predictions :param samples: the data source to be decoded :param hparams: decoding configs are included here :param decoder: it contains the main decoding operations
Returns: the corresponding decoding results Return type: predictions
-
deploy(self)¶ deployment function
-
-
class
athena.RNNLM(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModelStandard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn
-
default_config¶
-
call(self, samples, training: bool = None)¶ call model
-
save_model(self, path)¶ for saving model and current weight, path is h5 file name, like ‘my_model.h5’ usage: new_model = tf.keras.models.load_model(path)
-
get_loss(self, logits, samples, training=None)¶ get loss
-
-
class
athena.NeuralTranslateTransformer(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModelThis is an example of seq2seq model using transformer
-
default_config¶
-
call(self, samples, training=None)¶
-
static
_create_masks(x, y)¶
-
-
class
athena.SpeakerResnet(data_descriptions, config=None)¶ Bases:
athena.models.base.BaseModelA sample implementation of resnet 34 Reference to paper “Deep residual learning for image recognition” The implementation is the same as the standard resnet with 34 weighted layers, excepts using only 1/4 amount of filters to reduce computation. config:
task: “speaker_identification” or “speaker_verification”-
default_config¶
-
call(self, samples, training=None)¶ call model
-
init_loss(self, loss)¶ initialize loss function
-
get_loss(self, outputs, samples, training=None)¶
-
get_eer(self, outputs, samples, training=False)¶ get equal error rates
-
make_resnet_block_layer(self, num_filter, num_blocks, stride=1)¶
-
-
class
athena.BaseSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶ Bases:
tensorflow.keras.ModelBase Solver.
-
default_config¶
-
static
initialize_devices(visible_gpu_idx=None)¶ initialize hvd devices, should be called firstly
-
static
clip_by_norm(grads, norm)¶ clip norm using tf.clip_by_norm
-
train_step(self, samples)¶ train the model 1 step
-
train(self, dataset, total_batches=-1)¶ Update the model in 1 epoch
-
evaluate_step(self, samples)¶ evaluate the model 1 step
-
evaluate(self, dataset, epoch)¶ evaluate the model
-
-
class
athena.HorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶ Bases:
athena.solver.BaseSolverA multi-processer solver based on Horovod
-
static
initialize_devices(visible_gpu_idx=None)¶ initialize hvd devices, should be called firstly
-
train_step(self, samples)¶ train the model 1 step
-
train(self, dataset, total_batches=-1)¶ Update the model in 1 epoch
-
evaluate(self, dataset, epoch=0)¶ evaluate the model
-
static
-
class
athena.DecoderSolver(model, config=None, lm_model=None)¶ Bases:
athena.solver.BaseSolverDecoderSolver
-
default_config¶
-
decode(self, dataset, rank_size=1)¶ decode the model
-
-
class
athena.SynthesisSolver(model, data_descriptions=None, config=None)¶ Bases:
athena.solver.BaseSolverSynthesisSolver
-
default_config¶
-
synthesize(self, dataset)¶ synthesize using vocoder on dataset
-
-
class
athena.CTCLoss(logits_time_major=False, blank_index=-1, name='CTCLoss')¶ Bases:
tensorflow.keras.losses.LossCTC LOSS CTC LOSS implemented with Tensorflow
-
__call__(self, logits, samples, logit_length=None)¶
-
-
class
athena.Seq2SeqSparseCategoricalCrossentropy(num_classes, eos=-1, by_token=False, by_sequence=True, from_logits=True, label_smoothing=0.0)¶ Bases:
tensorflow.keras.losses.CategoricalCrossentropySeq2SeqSparseCategoricalCrossentropy LOSS CategoricalCrossentropy calculated at each character for each sequence in a batch
-
__call__(self, logits, samples, logit_length=None)¶
-
-
class
athena.CTCAccuracy(name='CTCAccuracy')¶ Bases:
athena.metrics.CharactorAccuracyCTCAccuracy Inherits CharactorAccuracy and implements CTC accuracy calculation
-
__call__(self, logits, samples, logit_length=None)¶ Accumulate errors and counts, logit_length is the output length of encoder
-
-
class
athena.Seq2SeqSparseCategoricalAccuracy(eos, name='Seq2SeqSparseCategoricalAccuracy')¶ Bases:
athena.metrics.CharactorAccuracySeq2SeqSparseCategoricalAccuracy Inherits CharactorAccuracy and implements Attention accuracy calculation
-
__call__(self, logits, samples, logit_length=None)¶ Accumulate errors and counts
-
-
class
athena.Checkpoint(checkpoint_directory=None, model=None, **kwargs)¶ Bases:
tensorflow.train.CheckpointA wrapper for Tensorflow checkpoint
Parameters: - checkpoint_directory – the directory for checkpoint
- summary_directory – the directory for summary used in Tensorboard
- provide the optimizer and model (__init__) –
- save the model (__call__) –
Example
transformer = SpeechTransformer(target_vocab_size=dataset_builder.target_dim) optimizer = tf.keras.optimizers.Adam() ckpt = Checkpoint(checkpoint_directory=’./train’, summary_directory=’./event’,
transformer=transformer, optimizer=optimizer)solver = BaseSolver(transformer) for epoch in dataset:
ckpt()-
_compare_and_save_best(self, loss, metrics, save_path)¶ compare and save the best model with best_loss and N best metrics
-
compute_nbest_avg(self, model_avg_num)¶ restore n-best avg checkpoint
-
__call__(self, loss=None, metrics=None)¶
-
restore_from_best(self)¶ restore from the best model
-
class
athena.WarmUpLearningSchedule(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0)¶ Bases:
tensorflow.keras.optimizers.schedules.LearningRateScheduleWarmUp Learning rate schedule for Adam
- Used as :
- optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512),
- beta_1=0.9, beta_2=0.98, epsilon=1e-9)
- Args :
- model_dim is the something related to total model parameters warmup_steps is the highest learning rate iters
Returns: return the learning rate Idea from the paper: Attention Is All You Need
-
__call__(self, step)¶
-
class
athena.WarmUpAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)¶ Bases:
tensorflow.keras.optimizers.AdamWarmUpAdam Implementation
-
default_config¶
-
-
class
athena.ExponentialDecayLearningRateSchedule(initial_lr=0.005, decay_steps=10000, decay_rate=0.5, start_decay_steps=30000, final_lr=1e-05)¶ Bases:
tensorflow.keras.optimizers.schedules.LearningRateScheduleExponentialDecayLearningRateSchedule
- Used as :
- optimizer = tf.keras.optimizers.Adam( learning_rate = ExponentialDecayLearningRate(0.01, 100))
- Args :
- initial_lr, decay_steps
Returns: initial_lr * (0.5 ** (step // decay_steps)) -
__call__(self, step)¶
-
class
athena.ExponentialDecayAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-06, amsgrad=False, name='WarmUpAdam', **kwargs)¶ Bases:
tensorflow.keras.optimizers.AdamWarmUpAdam Implementation
-
default_config¶
-
-
class
athena.HParams(model_structure=None, **kwargs)¶ Bases:
objectClass to hold a set of hyperparameters as name-value pairs.
A HParams object holds hyperparameters used to build and train a model, such as the number of hidden units in a neural net layer or the learning rate to use when training.
You first create a HParams object by specifying the names and values of the hyperparameters.
To make them easily accessible the parameter names are added as direct attributes of the class. A typical usage is as follows:
```python # Create a HParams object specifying names and values of the model # hyperparameters: hparams = HParams(learning_rate=0.1, num_hidden_units=100)
# The hyperparameter are available as attributes of the HParams object: hparams.learning_rate ==> 0.1 hparams.num_hidden_units ==> 100 ```
Hyperparameters have type, which is inferred from the type of their value passed at construction type. The currently supported types are: integer, float, boolean, string, and list of integer, float, boolean, or string.
You can override hyperparameter values by calling the [parse()](#HParams.parse) method, passing a string of comma separated name=value pairs. This is intended to make it possible to override any hyperparameter values from a single command-line flag to which the user passes ‘hyper-param=value’ pairs. It avoids having to define one flag for each hyperparameter.
The syntax expected for each value depends on the type of the parameter. See parse() for a description of the syntax.
Example:
```python # Define a command line flag to pass name=value pairs. # For example using argparse: import argparse parser = argparse.ArgumentParser(description=’Train my model.’) parser.add_argument(‘–hparams’, type=str,
help=’Comma separated list of “name=value” pairs.’)args = parser.parse_args() … def my_program():
# Create a HParams object specifying the names and values of the # model hyperparameters: hparams = tf.HParams(learning_rate=0.1, num_hidden_units=100,
activations=[‘relu’, ‘tanh’])# Override hyperparameters values by parsing the command line hparams.parse(args.hparams)
# If the user passed –hparams=learning_rate=0.3 on the command line # then ‘hparams’ has the following attributes: hparams.learning_rate ==> 0.3 hparams.num_hidden_units ==> 100 hparams.activations ==> [‘relu’, ‘tanh’]
# If the hyperparameters are in json format use parse_json: hparams.parse_json(‘{“learning_rate”: 0.3, “activations”: “relu”}’)
-
_HAS_DYNAMIC_ATTRIBUTES= True¶
-
add_hparam(self, name, value)¶ Adds {name, value} pair to hyperparameters.
Parameters: - name – Name of the hyperparameter.
- value – Value of the hyperparameter. Can be one of the following types:
- float, string, int list, float list, or string list. (int,) –
Raises: ValueError– if one of the arguments is invalid.
-
set_hparam(self, name, value)¶ Set the value of an existing hyperparameter.
This function verifies that the type of the value matches the type of the existing hyperparameter.
Parameters: - name – Name of the hyperparameter.
- value – New value of the hyperparameter.
Raises: KeyError– If the hyperparameter doesn’t exist.ValueError– If there is a type mismatch.
-
del_hparam(self, name)¶ Removes the hyperparameter with key ‘name’.
Does nothing if it isn’t present.
Parameters: name – Name of the hyperparameter.
-
parse(self, values, ignore_unknown=False)¶ Override existing hyperparameter values, parsing new values from a string.
See parse_values for more detail on the allowed format for values.
Parameters: - values – String. Comma separated list of name=value pairs where ‘value’
- follow the syntax described above. (must) –
Returns: The HParams instance.
Raises: ValueError– If values cannot be parsed or a hyperparameter in values- doesn’t exist.
-
override_from_dict(self, values_dict)¶ Override existing hyperparameter values, parsing new values from a dictionary.
Parameters: values_dict – Dictionary of name:value pairs.
Returns: The HParams instance.
Raises: KeyError– If a hyperparameter in values_dict doesn’t exist.ValueError– If values_dict cannot be parsed.
-
set_model_structure(self, model_structure)¶
-
get_model_structure(self)¶
-
to_json(self, indent=None, separators=None, sort_keys=False)¶ Serializes the hyperparameters into JSON.
Parameters: - indent – If a non-negative integer, JSON array elements and object members
- be pretty-printed with that indent level. An indent level of 0, or (will) –
- will only insert newlines. None (negative,) –
- compact representation. (most) –
- separators – Optional (item_separator, key_separator) tuple. Default is
- ', ' (`(',) –
’)`.
- sort_keys – If True, the output dictionaries will be sorted by key.
Returns: A JSON string.
-
parse_json(self, values_json)¶ Override existing hyperparameter values, parsing new values from a json object.
Parameters: values_json – String containing a json object of name:value pairs.
Returns: The HParams instance.
Raises: KeyError– If a hyperparameter in values_json doesn’t exist.ValueError– If values_json cannot be parsed.
-
values(self)¶ Return the hyperparameter values as a Python dictionary.
Returns: A dictionary with hyperparameter names as keys. The values are the hyperparameter values.
-
get(self, key, default=None)¶ Returns the value of key if it exists, else default.
-
__contains__(self, key)¶
-
__str__(self)¶ Return str(self).
-
__repr__(self)¶ Return repr(self).
-
static
_get_kind_name(param_type, is_list)¶ Returns the field name given parameter type and is_list.
Parameters: - param_type – Data type of the hparam.
- is_list – Whether this is a list.
Returns: A string representation of the field name.
Raises: ValueError– If parameter type is not recognized.
-
instantiate(self)¶
-
append(self, hp)¶
-
-
athena.register_and_parse_hparams(default_config: dict, config=None, **kwargs)¶ register default config and parse
-
athena.generate_square_subsequent_mask(size)¶ Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).
-
athena.get_wave_file_length(wave_file)¶ get the wave file length(duration) in ms
Parameters: wave_file – the path of wave file Returns: the length(ms) of the wave file
-
athena.set_default_summary_writer(summary_directory=None)¶
-
class
athena.BeamSearchDecoder(num_class, sos, eos, beam_size)¶ Beam search decoding used in seq2seq decoder layer This layer is used for evaluation
-
static
build_decoder(hparams, num_class, sos, eos, decoder_one_step, lm_model=None)¶ Allocate the time propagating function of the decoder, initialize the decoder
Parameters: - hparams – the decoding configs are included here
- num_class – the size of the vocab
- sos – the start symbol index
- eos – the end symbol index
- decoder_one_step – the time propagating function of the decoder
- lm_model – the initialized languange model
Returns: the initialized beam search decoder
Return type: beam_search_decoder
-
set_lm_model(self, lm_model)¶ set the lm_model :param lm_model: lm_model
-
set_ctc_scorer(self, ctc_scorer)¶ set the ctc_scorer :param ctc_scorer: the ctc scorer
-
beam_search_score(self, candidate_holder, encoder_outputs)¶ Call the time propagating function, fetch the acoustic score at the current step
If needed, call the auxiliary scorer and update cand_states in candidate_holder
Parameters: - candidate_holder – the param cand_seqs and the cand_logits of it is needed in the transformer decoder to calculate the output. type: CandidateHolder
- encoder_outputs – the encoder outputs from the transformer encoder. type: tuple, (encoder_outputs, input_mask)
-
deal_with_completed(self, completed_scores, completed_seqs, completed_length, new_scores, candidate_holder, max_seq_len)¶ - Add the new calculated completed seq with its score to completed seqs
- select top beam_size probable completed seqs with these corresponding scores
Parameters: - completed_scores – the scores of completed_seqs
- completed_seqs – historical top beam_size probable completed seqs
- completed_length – the length of completed_seqs
- new_scores – the current time step scores
- candidate_holder –
- max_seq_len – the maximum acceptable output length
Returns: new top probable scores completed_seqs: new top probable completed seqs completed_length: new top probable seq length
Return type: new_completed_scores
-
deal_with_uncompleted(self, new_scores, new_cand_logits, new_states, candidate_holder)¶ - select top probable candidate seqs from new predictions with its scores
- update candidate_holder based on top probable candidates
Parameters: - new_scores – the current time step prediction scores
- new_cand_logits – historical prediction scores
- new_states – updated states
- candidate_holder –
Returns: - cand_seqs, cand_logits, cand_states,
cand_scores, cand_parents will be updated here and sent to next time step
Return type: candidate_holder
-
__call__(self, cand_seqs, cand_states, init_states, encoder_outputs)¶ Parameters: - cand_seqs – TensorArray list, element shape: [beam]
- cand_states – [history_predictions]
- init_states – state list
- encoder_outputs – (encoder_outputs, memory_mask, …)
Returns: the sequence with highest score
Return type: completed_seqs
-
static