AI

Matching Latent Encoding for Audio-Text based Keyword Spotting

1 Mins read

Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown high-quality results, but the key challenge of how to semantically align two embeddings for multi-word keywords of different sequence lengths remains largely unsolved. In this paper, we propose an audio-text-based end-to-end model architecture for flexible keyword spotting (KWS), which builds upon learned audio and text embeddings. Our architecture uses a novel dynamic programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally partition the audio sequence into the same length as the word-based text sequence using the monotonic alignment of spoken content. Our proposed model consists of an encoder block to get audio and text embeddings, a projector block to project individual embeddings to a common latent space, and an audio-text aligner containing a novel DSP algorithm, which aligns the audio and text embeddings to determine if the spoken content is the same as the text. Experimental results show that our DSP is more effective than other partitioning schemes, and the proposed architecture outperformed the state-of-the-art results on the public dataset in terms of Area Under the ROC Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively.


Source link

Related posts
AI

Patronus AI Introduces the Industry's First Multimodal LLM-as-a-Judge (MLLM-as-a-Judge): Designed to Evaluate and Optimize AI Systems that Convert Image Inputs into Text Outputs

2 Mins read
​In recent years, the integration of image generation technologies into various platforms has opened new avenues for enhancing user experiences. However, as…
AI

Allen Institute for AI (AI2) Releases OLMo 32B: A Fully Open Model to Beat GPT 3.5 and GPT-4o mini on a Suite of Multi-Skill Benchmarks

2 Mins read
The rapid evolution of artificial intelligence (AI) has ushered in a new era of large language models (LLMs) capable of understanding and…
AI

This AI Paper Introduces BD3-LMs: A Hybrid Approach Combining Autoregressive and Diffusion Models for Scalable and Efficient Text Generation

3 Mins read
Traditional language models rely on autoregressive approaches, which generate text sequentially, ensuring high-quality outputs at the expense of slow inference speeds. In…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *