athena.transform.feats

Package Contents

Classes

ReadWav Read audio sample from wav file, return sample data and sample rate.
Spectrum Compute spectrum features of every frame in speech, return a float tensor
MelSpectrum Computing filter banks is applying triangular filters on a Mel-scale to the magnitude
Framepow Compute power of every frame in speech. Return a float tensor with
Pitch Compute pitch features of every frame in speech, return a float tensor
Mfcc Compute mfcc features of every frame in speech, return a float tensor
WriteWav Encode audio data (input) using sample rate (input),
Fbank Computing filter banks is applying triangular filters on a Mel-scale to the power
CMVN Do CMVN on features.
FbankPitch Compute Fbank && Pitch features respectively,and concate them. Return

Functions

compute_cmvn(audio_feature, mean=None, variance=None, local_cmvn=False) Compute cmvn on feature.
class athena.transform.feats.ReadWav(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Read audio sample from wav file, return sample data and sample rate.

classmethod params(cls, config=None)

Set params. :param config: contains one optional parameters: audio_channels(int, default=1). :return: An object of class HParams, which is a set of hyperparameters as

name-value pairs.
call(self, wavfile, speed=1.0)

Get audio data and sample rate from a wavfile. :param wavfile: filepath of wav

speed: Speed of sample channels wanted (float, default=1.0)
Returns:2 values. The first is a Tensor of audio data. The second return value is the sample rate of the input wav file, which is a tensor with float dtype.
class athena.transform.feats.Spectrum(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute spectrum features of every frame in speech, return a float tensor with size (num_frames, num_frequencies).

classmethod params(cls, config=None)

Set params. :param config: contains nine optional parameters:

window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If 1, the last frame (shorter than window_length)

will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)
raw_energy: If 1, compute frame energy before preemphasis and windowing.
If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
preEph_coeff: Coefficient for use in frame-signal preemphasis.
(float, default = 0.97)
window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
(string, default = “povey”)
remove_dc_offset: Subtract mean from waveform on each frame.
(bool, default = true)
is_fbank: If true, compute power spetrum without frame energy.
If false, using the frame energy instead of the square of the
constant component of the signal. (bool, default = false)
output_type: If 1, return power spectrum. If 2, return log-power spectrum.
If 3, return magnitude spectrum. (int, default = 2)
ither: Dithering constant (0.0 means no dither).
(float, default = 1) [add robust to training]
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, audio_data, sample_rate=None)

Caculate power spectrum or log power spectrum of audio data.

Parameters:
  • audio_data – the audio signal from which to compute spectrum. Should be an (1, N) tensor.
  • sample_rate – the sample rate of the signal we working with, default is 16kHz.
Returns:

spectrum: A float tensor of size (num_frames, num_frequencies) containing power spectrum (output_type=1) or log power spectrum (output_type=2) of every frame in speech.

dim(self)

dim

class athena.transform.feats.MelSpectrum(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Computing filter banks is applying triangular filters on a Mel-scale to the magnitude spectrum to extract frequency bands. Return a float tensor with shape (num_frames, num_channels).

classmethod params(cls, config=None)

Set params. :param config: contains thirteen optional parameters:

window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If True, the last frame (shorter than window_length) will be

cutoff. If False, 1 // 2 frame_length data will be padded to data. (bool, default = True)
raw_energy: If 1, compute frame energy before preemphasis and
windowing. If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
preEph_coeff: Coefficient for use in frame-signal preemphasis.
(float, default = 0.0)
window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
(string, default = “hann”)
remove_dc_offset: Subtract mean from waveform on each frame.
(bool, default = false)
is_fbank: If true, compute power spetrum without frame energy.
If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
output_type: If 1, return power spectrum. If 2, return log-power
spectrum. If 3, return magnitude spectrum. (int, default = 3)
upper_frequency_limit: High cutoff frequency for mel bins (if <= 0, offset
from Nyquist) (float, default = 0)

lower_frequency_limit: Low cutoff frequency for mel bins (float, default = 20) filterbank_channel_count: Number of triangular mel-frequency bins.

(float, default = 23)
dither: Dithering constant (0.0 means no dither).
(float, default = 0) [add robust to training]
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, audio_data, sample_rate)

Caculate logmelspectrum of audio data. :param audio_data: the audio signal from which to compute spectrum.

Should be an (1, N) tensor.
Parameters:sample_rate – the samplerate of the signal we working with, default is 16kHz.
Returns:A float tensor of size (num_frames, num_channels) containing melspectrum features of every frame in speech.
dim(self)

dim

num_channels(self)

number of channels

class athena.transform.feats.Framepow(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute power of every frame in speech. Return a float tensor with shape (1 * num_frames).

classmethod params(cls, config=None)

Set params. :param config: contains four optional parameters:

window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If True, the last frame (shorter than window_length)

will be cutoff. If False, 1 // 2 frame_length data will be padded to data. (int, default = True)

remove_dc_offset: Subtract mean from waveform on each frame (bool, default = true)

:return:An object of class HParams, which is a set of hyperparameters as name-value pairs.

call(self, audio_data, sample_rate)

Caculate power of every frame in speech. :param audio_data: the audio signal from which to compute spectrum.

Should be an (1, N) tensor.
Parameters:sample_rate – the samplerate of the signal we working with, default is 16kHz.
:return:A float tensor of size (1 * num_frames) containing power of every
frame in speech.
dim(self)

dim

class athena.transform.feats.Pitch(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute pitch features of every frame in speech, return a float tensor with size (num_frames, 2).

classmethod params(cls, config=None)

Set params. :param config: contains nineteen optional parameters:

delta-pitch: Smallest relative change in pitch that our algorithm
measures (float, default = 0.005)

window_length: Frame length in seconds (float, default = 0.025) frame_length: Frame shift in seconds (float, default = 0.010) frames-per-chunk: Only relevant for offline pitch extraction (e.g.

compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)
lowpass-cutoff : cutoff frequency for LowPass filter (Hz).
(float, default = 1000)
lowpass-filter-width: Integer that determines filter width of lowpass filter,
more gives sharper filter (int, default = 1)

max-f0: max. F0 to search for (Hz) (float, default = 400) max-frames-latency: Maximum number of frames of latency that we allow pitch

tracking to introduce into the feature processing (affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)

min-f0: min. F0 to search for (Hz) (float, default = 50) nccf-ballast: Increasing this factor reduces NCCF for quiet frames.

(float, default = 7000)
nccf-ballast-online: This is useful mainly for debug; it affects how the NCCF
ballast is computed. (bool, default = false)

penalty-factor: cost factor for FO change. (float, default = 0.1) preemphasis-coefficient: Coefficient for use in signal preemphasis (deprecated).

(float, default = 0)
recompute-frame: Only relevant for online pitch extraction, or for
compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)
resample-frequency: Frequency that we down-sample the signal to. Must be
more than twice lowpass-cutoff (float, default = 4000)
simulate-first-pass-online: If true, compute-kaldi-pitch-feats will output features
that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)
snip-edges: If this is set to false, the incomplete frames near the
ending edge won’t be snipped, so that the number of frames is the file size divided by the frame-shift. This makes different types of features give the same number of frames. (bool, default = true)
soft-min-f0: Minimum f0, applied in soft way, must not exceed min-f0.
(float, default = 10)
upsample-filter-width: Integer that determines filter width when upsampling
NCCF. (int, default = 5)
add-delta-pitch: If true, time derivative of log-pitch is added to
output features. (bool, default = true)
add-pov-feature: If true, the warped NCCF is added to output features.
(bool, default = true)
add-raw-log-pitch: If true, log(pitch) is added to output features.
(bool, default = false)
delay: Number of frames by which the pitch information is
delayed. (int, default = 0)
delta-pitch-noise-stddev: Standard deviation for noise we add to the delta
log-pitch (before scaling); should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values. (float, default = 0.005)
delta-pitch-scale: Term to scale the final delta log-pitch feature.
(float, default = 10)
delta-window: Number of frames on each side of central frame,
to use for delta window. (int, default = 2)
normalization-left-context: Left-context (in frames) for moving window
normalization. (int, default = 75)
normalization-right-context: Right-context (in frames) for moving window
normalization. (int, default = 75)
pitch-scale: Scaling factor for the final normalized log-pitch
value. (float, default = 2)
pov-offset: This can be used to add an offset to the POV feature.
Intended for use in online decoding as a substitute for CMN. (float, default = 0)
pov-scale: Scaling factor for final POV (probability of voicing)
feature. (float, default = 2)
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, audio_data, sample_rate)

Caculate picth features of audio data. :param audio_data: the audio signal from which to compute spectrum.

Should be an (1, N) tensor.
Parameters:sample_rate – the samplerate of the signal we working with.
Returns:A float tensor of size (num_frames, 2) containing pitch && POV features of every frame in speech.
dim(self)

dim

class athena.transform.feats.Mfcc(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute mfcc features of every frame in speech, return a float tensor with size (num_channels, num_frames, num_frequencies).

classmethod params(cls, config=None)

Set params. :param config: contains fourteen optional parameters.

window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If 1, the last frame (shorter than window_length) will

be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)
raw_energy: If 1, compute frame energy before preemphasis and
windowing. If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
preEph_coeff: Coefficient for use in frame-signal preemphasis.
(float, default = 0.97)
window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
(string, default = “povey”)
remove_dc_offset: Subtract mean from waveform on each frame
(bool, default = true)
is_fbank: If true, compute power spetrum without frame energy. If
false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
output_type: If 1, return power spectrum. If 2, return log-power
spectrum. (int, default = 1)
upper_frequency_limit: High cutoff frequency for mel bins (if < 0, offset from
Nyquist) (float, default = 0)

lower_frequency_limit: Low cutoff frequency for mel bins (float, default = 20) filterbank_channel_count: Number of triangular mel-frequency bins.

(float, default = 23)
coefficient_count: Number of cepstra in MFCC computation.
(int, default = 13)
cepstral_lifter: Constant that controls scaling of MFCCs.
(float, default = 22)
use_energy:Use energy (not C0) in MFCC computation.
(bool, default = True)
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, audio_data, sample_rate)

Caculate mfcc features of audio data. :param audio_data: the audio signal from which to compute spectrum.

Should be an (1, N) tensor.
Parameters:sample_rate – the sample rate of the signal we working with.
Returns:A float tensor of size (num_channels, num_frames, num_frequencies) containing mfcc features of every frame in speech.
dim(self)

dim

class athena.transform.feats.WriteWav(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Encode audio data (input) using sample rate (input), return a write wav opration.

classmethod params(cls, config=None)

Set params. :param config: contains one optional parameters:

sample_rate: the sample rate of the signal we working with.(int, default=16000).
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, filename, audio_data, sample_rate)

Write wav using audio_data[tensor]. :param filename: filepath of wav. :param audio_data: a tensor containing data of a wav. :param sample_rate: the samplerate of the signal we working with. :return: write wav opration.

class athena.transform.feats.Fbank(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Computing filter banks is applying triangular filters on a Mel-scale to the power spectrum to extract frequency bands. Return a float tensor with shape (num_channels, num_frames, num_frequencies).

classmethod params(cls, config=None)

Set params. :param config: contains thirteen optional parameters: :param window_length: Window length in seconds. (float, default = 0.025) :param frame_length: Hop length in seconds. (float, default = 0.010)

snip_edges: If 1, the last frame (shorter than window_length) will be
cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)
raw_energy: If 1, compute frame energy before preemphasis and
windowing. If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
preEph_coeff: Coefficient for use in frame-signal preemphasis.
(float, default = 0.97)
window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
(string, default = “povey”)
remove_dc_offset: Subtract mean from waveform on each frame.
(bool, default = true)
is_fbank: If true, compute power spetrum without frame energy.
If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
is_log10: If true, using log10 to fbank. If false, using loge.
(bool, default = false)
output_type: If 1, return power spectrum. If 2, return log-power
spectrum. (int, default = 1)
upper_frequency_limit: High cutoff frequency for mel bins (if <= 0, offset
from Nyquist) (float, default = 0)

lower_frequency_limit: Low cutoff frequency for mel bins (float, default = 20) filterbank_channel_count: Number of triangular mel-frequency bins.

(float, default = 23)
dither: Dithering constant (0.0 means no dither).
(float, default = 1) [add robust to training]
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, audio_data, sample_rate)

Caculate fbank features of audio data. :param audio_data: the audio signal from which to compute spectrum.

Should be an (1, N) tensor.
Parameters:sample_rate – the samplerate of the signal we working with, default is 16kHz.
Returns:A float tensor of size (num_channels, num_frames, num_frequencies) containing fbank features of every frame in speech.
dim(self)

Return dimension of fbank.

num_channels(self)

Return number of channels of fbank.

class athena.transform.feats.CMVN(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Do CMVN on features.

classmethod params(cls, config=None)

set params

call(self, audio_feature, speed=1.0)

implementation func

dim(self)

dim

athena.transform.feats.compute_cmvn(audio_feature, mean=None, variance=None, local_cmvn=False)

Compute cmvn on feature.

class athena.transform.feats.FbankPitch(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute Fbank && Pitch features respectively,and concate them. Return a tensor with shape (num_frames, dim_features).

classmethod params(cls, config=None)

Set params. :param config: contains twenty-nine optional parameters:t

window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If 1, the last frame (shorter than window_length) will

be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)
raw_energy: If 1, compute frame energy before preemphasis and
windowing. If 2, compute frame energy after preemphasis
and windowing. (int, default = 1)
preEph_coeff: Coefficient for use in frame-signal preemphasis.
(float, default = 0.97)
window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
(string, default = “povey”)
remove_dc_offset: Subtract mean from waveform on each frame.
(bool, default = true)
is_fbank: If true, compute power spetrum without frame
energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
output_type: If 1, return power spectrum. If 2, return
log-power spectrum. (int, default = 1)
upper_frequency_limit: High cutoff frequency for mel bins.
(if <= 0, offset from Nyquist) (float, default = 0)
lower_frequency_limit: Low cutoff frequency for mel bins.
(float, default = 20)
filterbank_channel_count: Number of triangular mel-frequency bins.
(float, default = 23)
dither: Dithering constant (0.0 means no dither).
(float, default = 1)[add robust to training]
delta-pitch: Smallest relative change in pitch that our
algorithm measures. (float, default = 0.005)
frames-per-chunk: Only relevant for offline pitch extraction.
(e.g. compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)
lowpass-cutoff: cutoff frequency for LowPass filter (Hz).
(float, default = 1000)
lowpass-filter-width: Integer that determines filter width of lowpass filter,
more gives sharper filter (int, default = 1)

max-f0: max. F0 to search for (Hz) (float, default = 400) max-frames-latency: Maximum number of frames of latency that we allow pitch

tracking to introduce into the feature processing

(affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)

min-f0: min. F0 to search for (Hz) (float, default = 50) nccf-ballast: Increasing this factor reduces NCCF for quiet frames.

(float, default = 7000)
nccf-ballast-online: This is useful mainly for debug; it affects how the
NCCF ballast is computed. (bool, default = false)

penalty-factor: cost factor for FO change. (float, default = 0.1) preemphasis-coefficient: Coefficient for use in signal preemphasis (deprecated)

(float, default = 0)
recompute-frame: Only relevant for online pitch extraction, or for
compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)
resample-frequency: Frequency that we down-sample the signal to. Must be
more than twice lowpass-cutoff (float, default = 4000)
simulate-first-pass-online: If true, compute-kaldi-pitch-feats will output features
that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)
soft-min-f0: Minimum f0, applied in soft way, must not exceed
min-f0 (float, default = 10)
upsample-filter-width: Integer that determines filter width when upsampling
NCCF (int, default = 5)
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, audio_data, sample_rate)

Caculate fbank && pitch(concat) features of wav. :param audio_data: the audio signal from which to compute spectrum.

Should be an (1, N) tensor.
Parameters:sample_rate – the samplerate of the signal we working with.
Returns:A tensor with shape (num_frames, dim_features), containing fbank && pitch feature of every frame in speech.
dim(self)