athena.layers.attention

Attention layers.

Module Contents

Classes

ScaledDotProductAttention Calculate the attention weights.
MultiHeadAttention Multi-head attention
BahdanauAttention the Bahdanau Attention
HanAttention Refer to [Hierarchical Attention Networks for Document Classification]
MatchAttention Refer to [Learning Natural Language Inference with LSTM]
LocationAttention location-aware attention
StepwiseMonotonicAttention stepwise monotonic attention
class athena.layers.attention.ScaledDotProductAttention(unidirectional=False, look_ahead=0)

Bases: tensorflow.keras.layers.Layer

Calculate the attention weights.

q, k, v must have matching leading dimensions.

k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v. The mask has different shapes depending on its type(padding or look ahead) but it must be broadcastable for addition.

Parameters:
  • q – query shape == (…, seq_len_q, depth)
  • k – key shape == (…, seq_len_k, depth)
  • v – value shape == (…, seq_len_v, depth_v)
  • mask – Float tensor with shape broadcastable to (…, seq_len_q, seq_len_k). Defaults to None.
Returns:

output, attention_weights

call(self, q, k, v, mask)

This is where the layer’s logic lives.

class athena.layers.attention.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)

Bases: tensorflow.keras.layers.Layer

Multi-head attention

Multi-head attention consists of four parts: * Linear layers and split into heads. * Scaled dot-product attention. * Concatenation of heads. * Final linear layer. Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.

Parameters:
  • param1 (int) – The first parameter.
  • param2 (str) – The second parameter.
Returns:

The return value. True for success, False otherwise.

Return type:

bool

split_heads(self, x, batch_size)

Split the last dimension into (num_heads, depth).

Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)

call(self, v, k, q, mask)

call function

class athena.layers.attention.BahdanauAttention(units, input_dim=1024)

Bases: tensorflow.keras.Model

the Bahdanau Attention

call(self, query, values)

call function

class athena.layers.attention.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)

Bases: tensorflow.keras.layers.Layer

Refer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf) wrap with tf.variable_scope(name, reuse=tf.AUTO_REUSE): Input shape: (Batch size, steps, features) Output shape: (Batch size, features)

build(self, input_shape)

build in keras layer

call(self, inputs, training=None, mask=None)

call function in keras

compute_output_shape(self, input_shape)

compute output shape

_masked_softmax(self, logits, mask, axis)

Compute softmax with input mask.

class athena.layers.attention.MatchAttention(config, **kwargs)

Bases: tensorflow.keras.layers.Layer

Refer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170) wrap with tf.variable_scope(name, reuse=tf.AUTO_REUSE): Input shape: (Batch size, steps, features) Output shape: (Batch size, steps, features)

call(self, tensors)

Attention layer.

class athena.layers.attention.LocationAttention(attn_dim, conv_channel, aconv_filts, scaling=1.0)

Bases: tensorflow.keras.layers.Layer

location-aware attention

Reference: Attention-Based Models for Speech Recognition
(https://arxiv.org/pdf/1506.07503.pdf)
compute_score(self, value, value_length, query, accum_attn_weight)
Parameters:
  • value_length – the length of value, shape: [batch]
  • max_len – the maximun length
Returns:

initializes to uniform distributions, shape: [batch, max_len]

Return type:

initialized_weights

initialize_weights(self, value_length, max_len)
Parameters:
  • value_length – the length of value, shape: [batch]
  • max_len – the maximun length
Returns:

initializes to uniform distributions, shape: [batch, max_len]

Return type:

initialized_weights

call(self, attn_inputs, prev_states, training=True)
Parameters:
  • attn_inputs (tuple) – it contains 2 params: value, shape: [batch, x_steps, eunits] value_length, shape: [batch]
  • prev_states (tuple) – it contains 3 params: query: previous rnn state, shape: [batch, dunits] accum_attn_weight: previous accumulated attention weights, shape: [batch, x_steps] prev_attn_weight: previous attention weights, shape: [batch, x_steps]
  • training – if it is in the training step
Returns:

attended vector, shape: [batch, eunits] attn_weight: attention scores, shape: [batch, x_steps]

Return type:

attn_c

class athena.layers.attention.StepwiseMonotonicAttention(attn_dim, conv_channel, aconv_filts, sigmoid_noise=2.0, score_bias_init=0.0, mode='soft')

Bases: athena.layers.attention.LocationAttention

stepwise monotonic attention

Reference: Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic
Attention for Neural TTS (https://arxiv.org/pdf/1906.00672.pdf)
build(self, _)

A Modified Energy Function is used and the params are defined here. Reference: Online and Linear-Time Attention by Enforcing Monotonic Alignments (https://arxiv.org/pdf/1704.00784.pdf).

initialize_weights(self, value_length, max_len)
Parameters:
  • value_length – the length of value, shape: [batch]
  • max_len – the maximun length
Returns:

initializes to dirac distributions, shape: [batch, max_len]

Return type:

initialized_weights

Examples

An initialized_weights the shape of which is [2, 4]: [[1, 0, 0, 0],

[1, 0, 0, 0]]
step_monotonic_function(self, sigmoid_probs, prev_weights)

hard mode can only be used in the synthesis step :param sigmoid_probs: sigmoid probabilities, shape: [batch, x_steps] :param prev_weights: previous attention weights, shape: [batch, x_steps]

Returns:new attention weights, shape: [batch, x_steps]
Return type:weights
call(self, attn_inputs, prev_states, training=True)
Parameters:
  • attn_inputs (tuple) – it contains 2 params: value, shape: [batch, x_steps, eunits] value_length, shape: [batch]
  • prev_states (tuple) – it contains 3 params: query: previous rnn state, shape: [batch, dunits] accum_attn_weight: previous accumulated attention weights, shape: [batch, x_steps] prev_attn_weight: previous attention weights, shape: [batch, x_steps]
  • training – if it is in the training step
Returns:

attended vector, shape: [batch, eunits] attn_weight: attention scores, shape: [batch, x_steps]

Return type:

attn_c