In recent times, with Artificial Intelligence becoming extremely popular, the field of Automated Speech Recognition (ASR) has seen tremendous progress. It has changed the face of voice-activated technologies and human-computer interaction. With ASR, machines can translate spoken language into text, which is essential for a variety of applications, including virtual assistants and transcription services. Researchers have been putting in efforts to find underlying algorithms as there is a need for more precise and effective ASR systems.
In recent research by NVIDIA, a team of researchers has studied the drawbacks of Connectionist Temporal Classification (CTC) models. In ASR pipelines, CTC models have become a leading contender for attaining great accuracy. These models are especially good at handling the subtleties of spoken language because they are very good at interpreting temporal sequences. Though accurate, the conventional CPU-based beam search decoding method has limited the performance of CTC models.
The beam search decoding process is an essential stage in accurately transcribing spoken words. The traditional method, which is the greedy search method, uses the acoustic model to determine which output token is most likely to be selected at each time step. When it comes to handling contextual biases and outside data, there are a number of challenges that accompany this approach.
To overcome all these challenges, the team has proposed the GPU-accelerated Weighted Finite State Transducer (WFST) beam search decoder as a solution. This approach has been introduced with the aim of integrating it smoothly with current CTC models. With this GPU-accelerated decoder, the ASR pipeline’s performance can be improved, along with throughput, latency, and support for features like on-the-fly composition for utterance-specific word boosting. The suggested GPU-accelerated decoder is especially well-suited for streaming inference because of its improved pipeline throughput and lower latency.
The team has evaluated this approach by testing the decoder in both offline and online environments. When compared to the state-of-the-art CPU decoder, the GPU-accelerated decoder showed up to seven times higher throughput in the offline scenario. The GPU-accelerated decoder achieved over eight times lower latency in the online streaming scenario while maintaining the same or even higher word error rates. These findings show that employing the suggested GPU-accelerated WFST beam search decoder with CTC models significantly improves efficiency and accuracy.
In conclusion, this approach can definitely work excellently in overcoming CPU-based beam search decoding’s performance constraints in CTC models. The suggested GPU-accelerated decoder is the quickest beam search decoder for CTC models in both offline and online contexts since it enhances throughput, lowers latency, and supports advanced features. To help with the decoder’s integration with Python-based machine learning frameworks, the team has made pre-built DLPack-based Python bindings available on GitHub. This work adds to the suggested solution’s usability and accessibility for Python developers with ML frameworks. The code repository can be accessed at https://github.com/nvidia-riva/riva-asrlib-decoder with a CUDA WFST decoder described as a C++ and Python library.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.