athena.transform.feats.pitch

This model extracts pitch features per frame.

Module Contents

Classes

Pitch Compute pitch features of every frame in speech, return a float tensor
class athena.transform.feats.pitch.Pitch(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute pitch features of every frame in speech, return a float tensor with size (num_frames, 2).

classmethod params(cls, config=None)

Set params. :param config: contains nineteen optional parameters:

delta-pitch: Smallest relative change in pitch that our algorithm
measures (float, default = 0.005)

window_length: Frame length in seconds (float, default = 0.025) frame_length: Frame shift in seconds (float, default = 0.010) frames-per-chunk: Only relevant for offline pitch extraction (e.g.

compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)
lowpass-cutoff : cutoff frequency for LowPass filter (Hz).
(float, default = 1000)
lowpass-filter-width: Integer that determines filter width of lowpass filter,
more gives sharper filter (int, default = 1)

max-f0: max. F0 to search for (Hz) (float, default = 400) max-frames-latency: Maximum number of frames of latency that we allow pitch

tracking to introduce into the feature processing (affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)

min-f0: min. F0 to search for (Hz) (float, default = 50) nccf-ballast: Increasing this factor reduces NCCF for quiet frames.

(float, default = 7000)
nccf-ballast-online: This is useful mainly for debug; it affects how the NCCF
ballast is computed. (bool, default = false)

penalty-factor: cost factor for FO change. (float, default = 0.1) preemphasis-coefficient: Coefficient for use in signal preemphasis (deprecated).

(float, default = 0)
recompute-frame: Only relevant for online pitch extraction, or for
compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)
resample-frequency: Frequency that we down-sample the signal to. Must be
more than twice lowpass-cutoff (float, default = 4000)
simulate-first-pass-online: If true, compute-kaldi-pitch-feats will output features
that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)
snip-edges: If this is set to false, the incomplete frames near the
ending edge won’t be snipped, so that the number of frames is the file size divided by the frame-shift. This makes different types of features give the same number of frames. (bool, default = true)
soft-min-f0: Minimum f0, applied in soft way, must not exceed min-f0.
(float, default = 10)
upsample-filter-width: Integer that determines filter width when upsampling
NCCF. (int, default = 5)
add-delta-pitch: If true, time derivative of log-pitch is added to
output features. (bool, default = true)
add-pov-feature: If true, the warped NCCF is added to output features.
(bool, default = true)
add-raw-log-pitch: If true, log(pitch) is added to output features.
(bool, default = false)
delay: Number of frames by which the pitch information is
delayed. (int, default = 0)
delta-pitch-noise-stddev: Standard deviation for noise we add to the delta
log-pitch (before scaling); should be about the same as delta-pitch option to pitch creation. The purpose is to get rid of peaks in the delta-pitch caused by discretization of pitch values. (float, default = 0.005)
delta-pitch-scale: Term to scale the final delta log-pitch feature.
(float, default = 10)
delta-window: Number of frames on each side of central frame,
to use for delta window. (int, default = 2)
normalization-left-context: Left-context (in frames) for moving window
normalization. (int, default = 75)
normalization-right-context: Right-context (in frames) for moving window
normalization. (int, default = 75)
pitch-scale: Scaling factor for the final normalized log-pitch
value. (float, default = 2)
pov-offset: This can be used to add an offset to the POV feature.
Intended for use in online decoding as a substitute for CMN. (float, default = 0)
pov-scale: Scaling factor for final POV (probability of voicing)
feature. (float, default = 2)
Returns:An object of class HParams, which is a set of hyperparameters as name-value pairs.
call(self, audio_data, sample_rate)

Caculate picth features of audio data. :param audio_data: the audio signal from which to compute spectrum.

Should be an (1, N) tensor.
Parameters:sample_rate – the samplerate of the signal we working with.
Returns:A float tensor of size (num_frames, 2) containing pitch && POV features of every frame in speech.
dim(self)

dim