athena.transform.feats¶
Submodules¶
athena.transform.feats.base_frontendathena.transform.feats.cmvnathena.transform.feats.cmvn_testathena.transform.feats.fbankathena.transform.feats.fbank_pitchathena.transform.feats.fbank_pitch_testathena.transform.feats.fbank_testathena.transform.feats.framepowathena.transform.feats.framepow_testathena.transform.feats.mel_spectrumathena.transform.feats.mel_spectrum_testathena.transform.feats.mfccathena.transform.feats.mfcc_testathena.transform.feats.pitchathena.transform.feats.pitch_testathena.transform.feats.read_wavathena.transform.feats.read_wav_testathena.transform.feats.spectrumathena.transform.feats.spectrum_testathena.transform.feats.write_wavathena.transform.feats.write_wav_test
Package Contents¶
Classes¶
ReadWav |
Read audio sample from wav file, return sample data and sample rate. |
Spectrum |
Compute spectrum features of every frame in speech, return a float tensor |
MelSpectrum |
Computing filter banks is applying triangular filters on a Mel-scale to the magnitude |
Framepow |
Compute power of every frame in speech. Return a float tensor with |
Pitch |
Compute pitch features of every frame in speech, return a float tensor |
Mfcc |
Compute mfcc features of every frame in speech, return a float tensor |
WriteWav |
Encode audio data (input) using sample rate (input), |
Fbank |
Computing filter banks is applying triangular filters on a Mel-scale to the power |
CMVN |
Do CMVN on features. |
FbankPitch |
Compute Fbank && Pitch features respectively,and concate them. Return |
Functions¶
compute_cmvn(audio_feature, mean=None, variance=None, local_cmvn=False) |
Compute cmvn on feature. |
-
class
athena.transform.feats.ReadWav(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendRead audio sample from wav file, return sample data and sample rate.
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains one optional parameters: audio_channels(int, default=1). :return: An object of class HParams, which is a set of hyperparameters as
name-value pairs.
-
call(self, wavfile, speed=1.0)¶ Get audio data and sample rate from a wavfile. :param wavfile: filepath of wav
speed: Speed of sample channels wanted (float, default=1.0)Returns: 2 values. The first is a Tensor of audio data. The second return value is the sample rate of the input wav file, which is a tensor with float dtype.
-
classmethod
-
class
athena.transform.feats.Spectrum(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendCompute spectrum features of every frame in speech, return a float tensor with size (num_frames, num_frequencies).
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains nine optional parameters:
window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If 1, the last frame (shorter than window_length)
will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)- raw_energy: If 1, compute frame energy before preemphasis and windowing.
- If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
- preEph_coeff: Coefficient for use in frame-signal preemphasis.
- (float, default = 0.97)
- window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
- (string, default = “povey”)
- remove_dc_offset: Subtract mean from waveform on each frame.
- (bool, default = true)
- is_fbank: If true, compute power spetrum without frame energy.
- If false, using the frame energy instead of the square of the
- constant component of the signal. (bool, default = false)
- output_type: If 1, return power spectrum. If 2, return log-power spectrum.
- If 3, return magnitude spectrum. (int, default = 2)
- ither: Dithering constant (0.0 means no dither).
- (float, default = 1) [add robust to training]
Returns: An object of class HParams, which is a set of hyperparameters as name-value pairs.
-
call(self, audio_data, sample_rate=None)¶ Caculate power spectrum or log power spectrum of audio data.
Parameters: - audio_data – the audio signal from which to compute spectrum. Should be an (1, N) tensor.
- sample_rate – the sample rate of the signal we working with, default is 16kHz.
Returns: spectrum: A float tensor of size (num_frames, num_frequencies) containing power spectrum (output_type=1) or log power spectrum (output_type=2) of every frame in speech.
-
dim(self)¶ dim
-
classmethod
-
class
athena.transform.feats.MelSpectrum(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendComputing filter banks is applying triangular filters on a Mel-scale to the magnitude spectrum to extract frequency bands. Return a float tensor with shape (num_frames, num_channels).
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains thirteen optional parameters:
window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If True, the last frame (shorter than window_length) will be
cutoff. If False, 1 // 2 frame_length data will be padded to data. (bool, default = True)- raw_energy: If 1, compute frame energy before preemphasis and
- windowing. If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
- preEph_coeff: Coefficient for use in frame-signal preemphasis.
- (float, default = 0.0)
- window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
- (string, default = “hann”)
- remove_dc_offset: Subtract mean from waveform on each frame.
- (bool, default = false)
- is_fbank: If true, compute power spetrum without frame energy.
- If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
- output_type: If 1, return power spectrum. If 2, return log-power
- spectrum. If 3, return magnitude spectrum. (int, default = 3)
- upper_frequency_limit: High cutoff frequency for mel bins (if <= 0, offset
- from Nyquist) (float, default = 0)
lower_frequency_limit: Low cutoff frequency for mel bins (float, default = 20) filterbank_channel_count: Number of triangular mel-frequency bins.
(float, default = 23)- dither: Dithering constant (0.0 means no dither).
- (float, default = 0) [add robust to training]
Returns: An object of class HParams, which is a set of hyperparameters as name-value pairs.
-
call(self, audio_data, sample_rate)¶ Caculate logmelspectrum of audio data. :param audio_data: the audio signal from which to compute spectrum.
Should be an (1, N) tensor.Parameters: sample_rate – the samplerate of the signal we working with, default is 16kHz. Returns: A float tensor of size (num_frames, num_channels) containing melspectrum features of every frame in speech.
-
dim(self)¶ dim
-
num_channels(self)¶ number of channels
-
classmethod
-
class
athena.transform.feats.Framepow(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendCompute power of every frame in speech. Return a float tensor with shape (1 * num_frames).
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains four optional parameters:
window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If True, the last frame (shorter than window_length)
will be cutoff. If False, 1 // 2 frame_length data will be padded to data. (int, default = True)remove_dc_offset: Subtract mean from waveform on each frame (bool, default = true)
:return:An object of class HParams, which is a set of hyperparameters as name-value pairs.
-
call(self, audio_data, sample_rate)¶ Caculate power of every frame in speech. :param audio_data: the audio signal from which to compute spectrum.
Should be an (1, N) tensor.Parameters: sample_rate – the samplerate of the signal we working with, default is 16kHz. - :return:A float tensor of size (1 * num_frames) containing power of every
- frame in speech.
-
dim(self)¶ dim
-
classmethod
-
class
athena.transform.feats.Pitch(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendCompute pitch features of every frame in speech, return a float tensor with size (num_frames, 2).
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains nineteen optional parameters:
- delta-pitch: Smallest relative change in pitch that our algorithm
- measures (float, default = 0.005)
window_length: Frame length in seconds (float, default = 0.025) frame_length: Frame shift in seconds (float, default = 0.010) frames-per-chunk: Only relevant for offline pitch extraction (e.g.
compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)- lowpass-cutoff : cutoff frequency for LowPass filter (Hz).
- (float, default = 1000)
- lowpass-filter-width: Integer that determines filter width of lowpass filter,
- more gives sharper filter (int, default = 1)
max-f0: max. F0 to search for (Hz) (float, default = 400) max-frames-latency: Maximum number of frames of latency that we allow pitch
tracking to introduce into the feature processing (affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)min-f0: min. F0 to search for (Hz) (float, default = 50) nccf-ballast: Increasing this factor reduces NCCF for quiet frames.
(float, default = 7000)- nccf-ballast-online: This is useful mainly for debug; it affects how the NCCF
- ballast is computed. (bool, default = false)
penalty-factor: cost factor for FO change. (float, default = 0.1) preemphasis-coefficient: Coefficient for use in signal preemphasis (deprecated).
(float, default = 0)- recompute-frame: Only relevant for online pitch extraction, or for
- compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)
- resample-frequency: Frequency that we down-sample the signal to. Must be
- more than twice lowpass-cutoff (float, default = 4000)
- simulate-first-pass-online: If true, compute-kaldi-pitch-feats will output features
- that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)
- snip-edges: If this is set to false, the incomplete frames near the
- ending edge won’t be snipped, so that the number of frames is the file size divided by the frame-shift. This makes different types of features give the same number of frames. (bool, default = true)
- soft-min-f0: Minimum f0, applied in soft way, must not exceed min-f0.
- (float, default = 10)
- upsample-filter-width: Integer that determines filter width when upsampling
- NCCF. (int, default = 5)
- add-delta-pitch: If true, time derivative of log-pitch is added to
- output features. (bool, default = true)
- add-pov-feature: If true, the warped NCCF is added to output features.
- (bool, default = true)
- add-raw-log-pitch: If true, log(pitch) is added to output features.
- (bool, default = false)
- delay: Number of frames by which the pitch information is
- delayed. (int, default = 0)
- delta-pitch-noise-stddev: Standard deviation for noise we add to the delta
- log-pitch (before scaling); should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values. (float, default = 0.005)
- delta-pitch-scale: Term to scale the final delta log-pitch feature.
- (float, default = 10)
- delta-window: Number of frames on each side of central frame,
- to use for delta window. (int, default = 2)
- normalization-left-context: Left-context (in frames) for moving window
- normalization. (int, default = 75)
- normalization-right-context: Right-context (in frames) for moving window
- normalization. (int, default = 75)
- pitch-scale: Scaling factor for the final normalized log-pitch
- value. (float, default = 2)
- pov-offset: This can be used to add an offset to the POV feature.
- Intended for use in online decoding as a substitute for CMN. (float, default = 0)
- pov-scale: Scaling factor for final POV (probability of voicing)
- feature. (float, default = 2)
Returns: An object of class HParams, which is a set of hyperparameters as name-value pairs.
-
call(self, audio_data, sample_rate)¶ Caculate picth features of audio data. :param audio_data: the audio signal from which to compute spectrum.
Should be an (1, N) tensor.Parameters: sample_rate – the samplerate of the signal we working with. Returns: A float tensor of size (num_frames, 2) containing pitch && POV features of every frame in speech.
-
dim(self)¶ dim
-
classmethod
-
class
athena.transform.feats.Mfcc(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendCompute mfcc features of every frame in speech, return a float tensor with size (num_channels, num_frames, num_frequencies).
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains fourteen optional parameters.
window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If 1, the last frame (shorter than window_length) will
be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)- raw_energy: If 1, compute frame energy before preemphasis and
- windowing. If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
- preEph_coeff: Coefficient for use in frame-signal preemphasis.
- (float, default = 0.97)
- window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
- (string, default = “povey”)
- remove_dc_offset: Subtract mean from waveform on each frame
- (bool, default = true)
- is_fbank: If true, compute power spetrum without frame energy. If
- false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
- output_type: If 1, return power spectrum. If 2, return log-power
- spectrum. (int, default = 1)
- upper_frequency_limit: High cutoff frequency for mel bins (if < 0, offset from
- Nyquist) (float, default = 0)
lower_frequency_limit: Low cutoff frequency for mel bins (float, default = 20) filterbank_channel_count: Number of triangular mel-frequency bins.
(float, default = 23)- coefficient_count: Number of cepstra in MFCC computation.
- (int, default = 13)
- cepstral_lifter: Constant that controls scaling of MFCCs.
- (float, default = 22)
- use_energy:Use energy (not C0) in MFCC computation.
- (bool, default = True)
Returns: An object of class HParams, which is a set of hyperparameters as name-value pairs.
-
call(self, audio_data, sample_rate)¶ Caculate mfcc features of audio data. :param audio_data: the audio signal from which to compute spectrum.
Should be an (1, N) tensor.Parameters: sample_rate – the sample rate of the signal we working with. Returns: A float tensor of size (num_channels, num_frames, num_frequencies) containing mfcc features of every frame in speech.
-
dim(self)¶ dim
-
classmethod
-
class
athena.transform.feats.WriteWav(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendEncode audio data (input) using sample rate (input), return a write wav opration.
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains one optional parameters:
sample_rate: the sample rate of the signal we working with.(int, default=16000).Returns: An object of class HParams, which is a set of hyperparameters as name-value pairs.
-
call(self, filename, audio_data, sample_rate)¶ Write wav using audio_data[tensor]. :param filename: filepath of wav. :param audio_data: a tensor containing data of a wav. :param sample_rate: the samplerate of the signal we working with. :return: write wav opration.
-
classmethod
-
class
athena.transform.feats.Fbank(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendComputing filter banks is applying triangular filters on a Mel-scale to the power spectrum to extract frequency bands. Return a float tensor with shape (num_channels, num_frames, num_frequencies).
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains thirteen optional parameters: :param window_length: Window length in seconds. (float, default = 0.025) :param frame_length: Hop length in seconds. (float, default = 0.010)
- snip_edges: If 1, the last frame (shorter than window_length) will be
- cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)
- raw_energy: If 1, compute frame energy before preemphasis and
- windowing. If 2, compute frame energy after preemphasis and windowing. (int, default = 1)
- preEph_coeff: Coefficient for use in frame-signal preemphasis.
- (float, default = 0.97)
- window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
- (string, default = “povey”)
- remove_dc_offset: Subtract mean from waveform on each frame.
- (bool, default = true)
- is_fbank: If true, compute power spetrum without frame energy.
- If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
- is_log10: If true, using log10 to fbank. If false, using loge.
- (bool, default = false)
- output_type: If 1, return power spectrum. If 2, return log-power
- spectrum. (int, default = 1)
- upper_frequency_limit: High cutoff frequency for mel bins (if <= 0, offset
- from Nyquist) (float, default = 0)
lower_frequency_limit: Low cutoff frequency for mel bins (float, default = 20) filterbank_channel_count: Number of triangular mel-frequency bins.
(float, default = 23)- dither: Dithering constant (0.0 means no dither).
- (float, default = 1) [add robust to training]
Returns: An object of class HParams, which is a set of hyperparameters as name-value pairs.
-
call(self, audio_data, sample_rate)¶ Caculate fbank features of audio data. :param audio_data: the audio signal from which to compute spectrum.
Should be an (1, N) tensor.Parameters: sample_rate – the samplerate of the signal we working with, default is 16kHz. Returns: A float tensor of size (num_channels, num_frames, num_frequencies) containing fbank features of every frame in speech.
-
dim(self)¶ Return dimension of fbank.
-
num_channels(self)¶ Return number of channels of fbank.
-
classmethod
-
class
athena.transform.feats.CMVN(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendDo CMVN on features.
-
classmethod
params(cls, config=None)¶ set params
-
call(self, audio_feature, speed=1.0)¶ implementation func
-
dim(self)¶ dim
-
classmethod
-
athena.transform.feats.compute_cmvn(audio_feature, mean=None, variance=None, local_cmvn=False)¶ Compute cmvn on feature.
-
class
athena.transform.feats.FbankPitch(config: dict)¶ Bases:
athena.transform.feats.base_frontend.BaseFrontendCompute Fbank && Pitch features respectively,and concate them. Return a tensor with shape (num_frames, dim_features).
-
classmethod
params(cls, config=None)¶ Set params. :param config: contains twenty-nine optional parameters:t
window_length: Window length in seconds. (float, default = 0.025) frame_length: Hop length in seconds. (float, default = 0.010) snip_edges: If 1, the last frame (shorter than window_length) will
be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)- raw_energy: If 1, compute frame energy before preemphasis and
- windowing. If 2, compute frame energy after preemphasis
- and windowing. (int, default = 1)
- preEph_coeff: Coefficient for use in frame-signal preemphasis.
- (float, default = 0.97)
- window_type: Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”).
- (string, default = “povey”)
- remove_dc_offset: Subtract mean from waveform on each frame.
- (bool, default = true)
- is_fbank: If true, compute power spetrum without frame
- energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
- output_type: If 1, return power spectrum. If 2, return
- log-power spectrum. (int, default = 1)
- upper_frequency_limit: High cutoff frequency for mel bins.
- (if <= 0, offset from Nyquist) (float, default = 0)
- lower_frequency_limit: Low cutoff frequency for mel bins.
- (float, default = 20)
- filterbank_channel_count: Number of triangular mel-frequency bins.
- (float, default = 23)
- dither: Dithering constant (0.0 means no dither).
- (float, default = 1)[add robust to training]
- delta-pitch: Smallest relative change in pitch that our
- algorithm measures. (float, default = 0.005)
- frames-per-chunk: Only relevant for offline pitch extraction.
- (e.g. compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)
- lowpass-cutoff: cutoff frequency for LowPass filter (Hz).
- (float, default = 1000)
- lowpass-filter-width: Integer that determines filter width of lowpass filter,
- more gives sharper filter (int, default = 1)
max-f0: max. F0 to search for (Hz) (float, default = 400) max-frames-latency: Maximum number of frames of latency that we allow pitch
tracking to introduce into the feature processing(affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)
min-f0: min. F0 to search for (Hz) (float, default = 50) nccf-ballast: Increasing this factor reduces NCCF for quiet frames.
(float, default = 7000)- nccf-ballast-online: This is useful mainly for debug; it affects how the
- NCCF ballast is computed. (bool, default = false)
penalty-factor: cost factor for FO change. (float, default = 0.1) preemphasis-coefficient: Coefficient for use in signal preemphasis (deprecated)
(float, default = 0)- recompute-frame: Only relevant for online pitch extraction, or for
- compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)
- resample-frequency: Frequency that we down-sample the signal to. Must be
- more than twice lowpass-cutoff (float, default = 4000)
- simulate-first-pass-online: If true, compute-kaldi-pitch-feats will output features
- that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)
- soft-min-f0: Minimum f0, applied in soft way, must not exceed
- min-f0 (float, default = 10)
- upsample-filter-width: Integer that determines filter width when upsampling
- NCCF (int, default = 5)
Returns: An object of class HParams, which is a set of hyperparameters as name-value pairs.
-
call(self, audio_data, sample_rate)¶ Caculate fbank && pitch(concat) features of wav. :param audio_data: the audio signal from which to compute spectrum.
Should be an (1, N) tensor.Parameters: sample_rate – the samplerate of the signal we working with. Returns: A tensor with shape (num_frames, dim_features), containing fbank && pitch feature of every frame in speech.
-
dim(self)¶
-
classmethod