athena.transform.feats.spectrum

This model extracts spetrum features per frame.

Module Contents

Classes

Spectrum Compute spectrum features of every frame in speech, return a float tensor
class athena.transform.feats.spectrum.Spectrum(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute spectrum features of every frame in speech, return a float tensor with size (num_frames, num_frequencies).

classmethod params(cls, config=None)

Set params. :param config: contains nine optional parameters:

window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If 1, the last frame (shorter than window_length)

will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)
raw_energy: If 1, compute frame energy before preemphasis and windowing.
If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
preEph_coeff: Coefficient for use in frame-signal preemphasis.
(float, default = 0.97)
window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
(string, default = “povey”)
remove_dc_offset: Subtract mean from waveform on each frame.
(bool, default = true)
is_fbank: If true, compute power spetrum without frame energy.
If false, using the frame energy instead of the square of the
constant component of the signal. (bool, default = false)
output_type: If 1, return power spectrum. If 2, return log-power spectrum.
If 3, return magnitude spectrum. (int, default = 2)
ither: Dithering constant (0.0 means no dither).
(float, default = 1) [add robust to training]
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, audio_data, sample_rate=None)

Caculate power spectrum or log power spectrum of audio data.

Parameters:
  • audio_data – the audio signal from which to compute spectrum. Should be an (1, N) tensor.
  • sample_rate – the sample rate of the signal we working with, default is 16kHz.
Returns:

spectrum: A float tensor of size (num_frames, num_frequencies) containing power spectrum (output_type=1) or log power spectrum (output_type=2) of every frame in speech.

dim(self)

dim