After 25 years of research, Microsoft’s speech recognition technology has finally reached human parity: the system now has a word error rate of only 5.1%, a reduction of 12% over the last year. The achievement was made possible with a convolutional neural network combined bidirectional long-short-term memory.
We reduced our error rate by about 12 percent compared to last year’s accuracy level, using a series of improvements to our neural net-based acoustic and language models. We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels.
Discussion
Source: [H]ardOCP – Microsoft’s Speech Recognition Technology Has Hit Human-Level Accuracy