WER are we?¶

WER are we? An attempt at tracking states-of-the-art results and their corresponding codes on speech recognition. Feel free to correct! (Inspired by wer_are_we)

HKUST¶

(Possibly trained on more data than HKUST.)

CER Test	Paper	Published	Notes	Codes
21.2%	Improving Transformer-based Speech Recognition Using Unsupervised Pre-training	October 2019	Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Pre-training on 10,000 hours unlabeled speech	athena-team/Athena
22.75%	Improving Transformer-based Speech Recognition Using Unsupervised Pre-training	October 2019	Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Self data Pre-training	athena-team/Athena
23.09%	CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition	February 2020	CIF + SAN-based models (AM + LM) + speed perturbation + SpecAugment	None
23.5%	A Comparative Study on Transformer vs RNN in Speech Applications	September 2019	Transformer-CTC MTL + RNN-LM + speed perturbation	espnet/espnet
23.67%	Purely sequence-trained neural networks for ASR based on lattice-free MMI	2016	TDNN/HMM, lattice-free MMI + speed perturbation	kaldi-asr/kaldi
24.12%	Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping	February 2019	SAA Model + SAN-LM (joint training) + speed perturbation	None
27.67%	Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin	Feberary 2019	Extended-RNA + RNN-LM (joint training)	None
28.0%	Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM	June 2017	CTC-Attention MTL + joint decoding ( one-pass) + VGG Net + RNN-LM (seperate) + speed perturbation	espnet/espnet
29.9%	Joint CTC/attention decoding for end-to-end speech recognition	2017	CTC-Attention MTL-large + joint decoding (one pass) + speed perturbation	espnet/espnet

AISHELL-1¶

CER Dev	CER Test	Paper	Published	Notes	Codes
None	6.6%	Improving Transformer-based Speech Recognition Using Unsupervised Pre-training	October 2019	Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Self data Pre-training	athena-team/Athena
None	6.34%	CAT: CRF-Based ASR Toolkit	November 2019	VGG + BLSTM + CTC-CRF + 3-gram LM + speed perturbation	thu-spmi/CAT
6.0%	6.7%	A Comparative Study on Transformer vs RNN in Speech Applications	September 2019	Transformer-CTC MTL + RNN-LM + speed perturbation	espnet/espnet
None	7.43%	Purely sequence-trained neural networks for ASR based on lattice-free MMI	2016	TDNN/HMM, lattice-free MMI + speed perturbation	kaldi-asr/kaldi

THCHS-30¶

CER Word Task 0db white / car / cafeteria	PER Phone Task 0db white / car / cafeteria	Paper	Published	Notes	Codes
75.01% / 32.13% / 56.37%	46.95% / 15.96% / 32.56%	THCHS-30: A Free Chinese Speech Corpus	December 2015	DNN + DAE-based noise cancellation	kaldi-asr/kaldi
65.87% / 25.07% / 51.92%	39.80% / 11.48% / 30.55%	None	None	DNN + DAE-based noise cancellation	kaldi-asr/kaldi

LibriSpeech¶

(Possibly trained on more data than LibriSpeech.)

WER test-clean	WER test-other	Paper	Published	Notes	Codes
5.83%	12.69%	Humans Deep Speech 2: End-to-End Speech Recognition in English and Mandarin	December 2015	Humans	None
2.0%	4.1%	End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures	November 2019	Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring + 60k hours unlabeled	facebookresearch/wav2letter
2.3%	4.9%	Transformer-based Acoustic Modeling for Hybrid Speech Recognition	October 2019	Transformer AM (chenones) + 4-gram LM + Neural LM rescore (data augmentation:Speed perturbation and SpecAugment)	None
2.3%	5.0%	RWTH ASR Systems for LibriSpeech: Hybrid vs Attention	September 2019, Interspeech	HMM-DNN + lattice-based sMBR + LSTM LM + Transformer LM rescoring (no data augmentation)	rwth-i6/returnn rwth-i6/returnn-experiments
2.3%	5.2%	End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures	November 2019	Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring	facebookresearch/wav2letter
2.2%	5.8%	State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions	October 2019	Multi-stream self-attention in hybrid ASR + 4-gram LM + Neural LM rescore (no data augmentation)	s-omranpour/ ConvolutionalSpeechRecognition (not official)
2.5%	5.8%	SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition	April 2019	Listen Attend Spell	DemisEom/SpecAugment (not official)
3.2%	7.6%	From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition	October 2019	LC-BLSTM AM (chenones) + 4-gram LM (data augmentation:Speed perturbation and SpecAugment)	None
3.19%	7.64%	The CAPIO 2017 Conversational Speech Recognition System	April 2018	TDNN + TDNN-LSTM + CNN-bLSTM + Dense TDNN-LSTM across two kinds of trees + N-gram LM + Neural LM rescore	None
2.44%	8.29%	Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System	September 2019, Interspeech	encoder-attention-decoder + Transformer LM	None
3.80%	8.76%	Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks	Interspeech, Sept 2018	17-layer TDNN-F + iVectors	kaldi-asr/kaldi
2.8%	9.3%	RWTH ASR Systems for LibriSpeech: Hybrid vs Attention	September 2019, Interspeech	encoder-attention-decoder + BPE + Transformer LM (no data augmentation)	rwth-i6/returnn rwth-i6/returnn-experiments
3.26%	10.47%	Fully Convolutional Speech Recognition	December 2018	End-to-end CNN on the waveform + conv LM	None
3.82%	12.76%	Improved training of end-to-end attention models for speech recognition	Interspeech, Sept 2018	encoder-attention-decoder end-to-end model	rwth-i6/returnn rwth-i6/returnn-experiments
4.28%		Purely sequence-trained neural networks for ASR based on lattice-free MMI	September 2016	HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations	kaldi-asr/kaldi
4.83%		A time delay neural network architecture for efficient modeling of long temporal contexts	2015	HMM-TDNN + iVectors	kaldi-asr/kaldi
5.15%	12.73%	Deep Speech 2: End-to-End Speech Recognition in English and Mandarin	December 2015	9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters trained on 11940h	PaddlePaddle/DeepSpeech
5.51%	13.97%	LibriSpeech: an ASR Corpus Based on Public Domain Audio Books	2015	HMM-DNN + pNorm*	kaldi-asr/kaldi
4.8%	14.5%	Letter-Based Speech Recognition with Gated ConvNets	December 2017	(Gated) ConvNet for AM going to letters + 4-gram LM	None
8.01%	22.49%	same, Kaldi	2015	HMM-(SAT)GMM	kaldi-asr/kaldi
	12.51%	Audio Augmentation for Speech Recognition	2015	TDNN + pNorm + speed up/down speech	kaldi-asr/kaldi

WSJ¶

(Possibly trained on more data than WSJ.)

WER eval'92	WER eval'93	Paper	Published	Notes	Codes
5.03%	8.08%	Humans Deep Speech 2: End-to-End Speech Recognition in English and Mandarin	December 2015	Humans	None
2.9%	None	End-to-end Speech Recognition Using Lattice-Free MMI	September 2018	HMM-DNN LF-MMI trained (biphone)	None
3.10%	None	Deep Speech 2: End-to-End Speech Recognition in English and Mandarin	December 2015	9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters	PaddlePaddle/DeepSpeech
3.47%	None	Deep Recurrent Neural Networks for Acoustic Modelling	April 2015	TC-DNN-BLSTM-DNN	None
3.5%	6.8%	Fully Convolutional Speech Recognition	December 2018	End-to-end CNN on the waveform + conv LM	None
3.63%	5.66%	LibriSpeech: an ASR Corpus Based on Public Domain Audio Books	2015	test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*	kaldi-asr/kaldi
4.1%	None	End-to-end Speech Recognition Using Lattice-Free MMI	September 2018	HMM-DNN E2E LF-MMI trained (word n-gram)	None
5.6%	None	Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal	2014	CNN over RAW speech (wav)	None
5.7%	8.7%	End-to-end Speech Recognition from the Raw Waveform	June 2018	End-to-end CNN on the waveform	None

TIMIT¶

(So far, all results trained on TIMIT and tested on the core test set.)

PER	Paper	Published	Notes	Codes
13.8%	The Pytorch-Kaldi Speech Recognition Toolkit	February 2019	MLP+Li-GRU+MLP on MFCC+FBANK+fMLLR	mravanelli/pytorch-kaldi
14.9%	Light Gated Recurrent Units for Speech Recognition	March 2018	Removing the reset gate in GRU, using ReLU activation instead of tanh and batch normalization	mravanelli/pytorch-kaldi
16.5%	Phone recognition with hierarchical convolutional deep maxout networks	September 2015	Hierarchical maxout CNN + Dropout	None
16.5%	A Regularization Post Layer: An Additional Way how to Make Deep Neural Networks Robust	2017	DBN with last layer regularization	None
16.7%	Combining Time- and Frequency-Domain Convolution in Convolutional Neural Network-Based Phone Recognition	2014	CNN in time and frequency + dropout, 17.6% w/o dropout	None
16.8%	An investigation into instantaneous frequency estimation methods for improved speech recognition features	November 2017	DNN-HMM with MFCC + IFCC features	None
17.3%	Segmental Recurrent Neural Networks for End-to-end Speech Recognition	March 2016	RNN-CRF on 24(x3) MFSC	None
17.6%	Attention-Based Models for Speech Recognition	June 2015	Bi-RNN + Attention	Alexander-H-Liu/End-to-end-ASR-Pytorch (not official)
17.7%	Speech Recognition with Deep Recurrent Neural Networks	March 2013	Bi-LSTM + skip connections w/ RNN transducer (18.4% with CTC only)	1ytic/warp-rnnt (not official)
18.0%	Learning Filterbanks from Raw Speech for Phone Recognition	October 2017	Complex ConvNets on raw speech w/ mel-fbanks init	facebookresearch/wav2letter facebookresearch/tdfbanks
18.8%	Wavenet: A Generative Model For Raw Audio	September 2016	Wavenet architecture with mean pooling layer after residual block + few non-causal conv layers	ibab/tensorflow-wavenet (not official)
23%	Deep Belief Networks for Phone Recognition	2009	(first, modern) HMM-DBN	None

Hub5’00 Evaluation (Switchboard / CallHome)¶

(Possibly trained on more data than SWB, but test set = full Hub5’00.)

WER (SWB)	WER (CH)	Paper	Published	Notes	Codes
5.0%	9.1%	The CAPIO 2017 Conversational Speech Recognition System	December 2017	2 Dense LSTMs + 3 CNN-bLSTMs across 3 phonesets from previous Capio paper & AM adaptation using parameter averaging (5.6% SWB / 10.5% CH single systems)	None
5.1%	9.9%	Language Modeling with Highway LSTM	September 2017	HW-LSTM LM trained with Switchboard+Fisher+Gigaword+Broadcast News+Conversations, AM from previous IBM paper	None
5.1%	None	The Microsoft 2017 Conversational Speech Recognition System	August 2017	~2016 system + character-based dialog session aware (turns of speech) LSTM LM	None
5.3%	10.1%	Deep Learning-based Telephony Speech Recognition in the Wild	August 2017	Ensemble of 3 CNN-bLSTM (5.7% SWB / 11.3% CH single systems)	None
5.5%	10.3%	English Conversational Telephone Speech Recognition by Humans and Machines	March 2017	ResNet + BiLSTMs acoustic model, with 40d FMLLR + i-Vector inputs, trained on SWB+Fisher+CH, n-gram + model-M + LSTM + Strided (à trous) convs-based LM trained on Switchboard+Fisher+Gigaword+Broadcast	None
6.3%	11.9%	The Microsoft 2016 Conversational Speech Recognition System	September 2016	VGG/Resnet/LACE/BiLSTM acoustic model trained on SWB+Fisher+CH, N-gram + RNNLM language model trained on Switchboard+Fisher+Gigaword+Broadcast	None
6.6%	12.2%	The IBM 2016 English Conversational Telephone Speech Recognition System	June 2016	RNN + VGG + LSTM acoustic model trained on SWB+Fisher+CH, N-gram + "model M" + NNLM language model	None
6.8%	14.1%	SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition	April 2019	Listen Attend Spell	DemisEom/SpecAugment (not official)
8.5%	13%	Purely sequence-trained neural networks for ASR based on lattice-free MMI	September 2016	HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher	kaldi-asr/kaldi
9.2%	13.3%	Purely sequence-trained neural networks for ASR based on lattice-free MMI	September 2016	HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher (10% / 15.1% respectively trained on SWBD only)	kaldi-asr/kaldi
12.6%	16%	Deep Speech: Scaling up end-to-end speech recognition	December 2014	CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trained only on SWB	mozilla/DeepSpeech (not official)
11%	17.1%	A time delay neural network architecture for efficient modeling of long temporal contexts	2015	HMM-TDNN + iVectors	kaldi-asr/kaldi
12.6%	18.4%	Sequence-discriminative training of deep neural networks	2013	HMM-DNN +sMBR	kaldi-asr/kaldi
12.9%	19.3%	Audio Augmentation for Speech Recognition	2015	HMM-TDNN + pNorm + speed up/down speech	kaldi-asr/kaldi
15%	19.1%	Building DNN Acoustic Models for Large Vocabulary Speech Recognition	June 2014	DNN + Dropout	None
10.4%	None	Joint Training of Convolutional and Non-Convolutional Neural Networks	2014	CNN on MFSC/fbanks + 1 non-conv layer for FMLLR/I-Vectors concatenated in a DNN	None
11.5%	None	Deep Convolutional Neural Networks for LVCSR	2013	CNN	None
12.2%	None	Very Deep Multilingual Convolutional Neural Networks for LVCSR	September 2015	Deep CNN (10 conv, 4 FC layers), multi-scale feature maps	None
11.8%	25.7%	Improved training of end-to-end attention models for speech recognition	Interspeech, Sept 2018	encoder-attention-decoder end-to-end model, trained on 300h SWB	rwth-i6/returnn rwth-i6/returnn-experiments

Lexicon¶

WER: word error rate
PER: phone error rate
LM: language model
HMM: hidden markov model
GMM: Gaussian mixture model
DNN: deep neural network
CNN: convolutional neural network
DBN: deep belief network (RBM-based DNN)
TDNN-F: a factored form of time delay neural networks (TDNN)
RNN: recurrent neural network
LSTM: long short-term memory
CTC: connectionist temporal classification
MMI: maximum mutual information (MMI),
MPE: minimum phone error
sMBR: state-level minimum Bayes risk
SAT: speaker adaptive training
MLLR: maximum likelihood linear regression
LDA: (in this context) linear discriminant analysis
MFCC: Mel frequency cepstral coefficients
FB/FBANKS/MFSC: Mel frequency spectral coefficients
IFCC: Instantaneous frequency cosine coefficients (https://github.com/siplabiith/IFCC-Feature-Extraction)
VGG: very deep convolutional neural networks from Visual Graphics Group, VGG is an architecture of 2 {3x3 convolutions} followed by 1 pooling, repeated