With the addition of the TED-LIUM 3 corpus and positive results from the auto-review process the r20190609 release of the English Zamia-Speech models for Kaldi has been trained on the largest amount of audio material yet (over 1100 hours):
zamia_en 0:05:38 voxforge_en 102:07:05 cv_corpus_v1 252:31:11 librispeech 450:49:09 ljspeech 23:13:54 m_ailabs_en 106:28:20 tedlium3 210:13:30
additionally 400 hours of noise-augmented audio derived from the above corpora were used (background noise and phone codecs):
voxforge_en_noisy 22:01:40 librispeech_noisy 119:03:26 cv_corpus_v1_noisy 78:57:16 cv_corpus_v1_phone 61:38:33 zamia_en_noisy 0:02:08 voxforge_en_phone 18:02:35 librispeech_phone 106:35:33 zamia_en_phone 0:01:11
so in total this release has been trained on over 1500 hours of audio material (training took over 6 weeks on a GeForce GTX 1080 Ti GPU).
%WER 10.64 exp/nnet3_chain/tdnn_250/decode_test/wer_8_0.0 %WER 8.84 exp/nnet3_chain/tdnn_f/decode_test/wer_8_0.0 %WER 5.80 exp/nnet3_chain/tdnn_fl/decode_test/wer_9_0.0
The tdnn_250 model is the smallest one meant for use in embedded applications (i.e. RPi-3 class hardware), tdnn_f is our regular model, tdnn_fl is the tdnn_f model adapted to a larger language model (results illustrate the importance of language model domain adaptation btw.).