Neural generative models have transformed the way we consume digital content, revolutionizing various aspects. They have the capability to generate high-quality images, ensure coherence in long spans of text, and even produce speech and audio. Among the different approaches, diffusion-based generative models have gained prominence and have shown promising results across various tasks.
During the diffusion process, the model learns to map a predefined noise distribution to the target data distribution. At each step, the model predicts the noise and generates the signal from the target distribution. Diffusion models can operate on different forms of data representations, such as raw input and latent representations.
State-of-the-art models, such as Stable Diffusion, DALLE, and Midjourney, have been developed for text-to-image synthesis tasks. Although the interest in X-to-Y generation has increased in recent years, audio-to-image models have not yet been deeply explored.
The reason for using audio signals rather than text prompts is due to the interconnection between images and audio in the context of videos. In contrast, although text-based generative models can produce remarkable images, textual descriptions are not inherently connected to the image, meaning that textual descriptions are typically added manually. Audio signals have, furthermore, the ability to represent complex scenes and objects, such as different variations of the same instrument (e.g., classic guitar, acoustic guitar, electric guitar, etc.) or different perspectives of the identical object (e.g., classic guitar recorded in a studio versus a live show). The manual annotation of such detailed information for distinct objects is labor-intensive, which makes scalability challenging.
Previous studies have proposed several methods for generating audio from image inputs, primarily using a Generative Adversarial Network (GAN) to generate images based on audio recordings. However, there are notable distinctions between their work and the proposed method. Some methods focused on generating MNIST digits exclusively and did not extend their approach to encompass general audio sounds. Others did generate images from general audio but resulted in low-quality images.
To overcome the limitations of these studies, a DL model for audio-to-image generation has been proposed. Its overview is depicted in the figure below.
This approach involves leveraging a pre-trained text-to-image generation model and a pre-trained audio representation model to learn an adaptation layer mapping between their outputs and inputs. Drawing from recent work on textual inversions, a dedicated audio token is introduced to map the audio representations into an embedding vector. This vector is then forwarded into the network as a continuous representation, reflecting a new word embedding.
The Audio Embedder utilizes a pre-trained audio classification network to capture the audio’s representation. Typically, the last layer of the discriminative network is employed for classification purposes, but it often overlooks important audio details unrelated to the discriminative task. To address this, the approach combines earlier layers with the last hidden layer, resulting in a temporal embedding of the audio signal.
Sample results produced by the presented model are reported below.
This was the summary of AudioToken, a novel Audio-to-Image (A2I) synthesis model. If you are interested, you can learn more about this technique in the links below.
Check Out The Paper. Don’t forget to join our 24k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Featured Tools From AI Tools Club
🚀 Check Out 100’s AI Tools in AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.