Text-to-Image Diffusion Models represent a groundbreaking approach to generating images from textual prompts. They leverage the power of deep learning and probabilistic modeling to capture the subtle relationships between language and visual concepts. By conditioning a generative model on textual descriptions, these models learn to synthesize realistic images that faithfully depict the given input.
At the heart of Text-to-Image Diffusion Models lies the concept of diffusion, a process inspired by statistical physics. The key idea behind diffusion is to iteratively refine an initially noisy image, gradually making it more realistic and coherent by following the gradients of a learned diffusion model. By extending this principle to text-to-image synthesis, researchers have achieved remarkable results, allowing for the creation of high-resolution, detailed images from text prompts with impressive fidelity and diversity.
However, training such models poses significant challenges. Generating high-quality images from textual descriptions requires navigating a vast and complex space of possible visual interpretations, making it difficult to ensure stability during the learning process. Stable Diffusion stabilizes the training process by guiding the model to capture the underlying semantics of the text and generate coherent images without sacrificing diversity. This results in more reliable and controlled image generation, empowering artists, designers, and developers to produce captivating visual content with greater precision and control.
A massive drawback of Stable Diffusion is that its extensive architecture demands significant computational resources and results in prolonged inference time. To address this concern, several methods have been proposed to enhance the efficiency of Stable Diffusion Models (SDMs). Some methods tried to reduce the number of denoising steps by distilling a pre-trained diffusion model, which is used to guide a similar model with fewer sampling steps. Other approaches employed post-training quantization techniques to reduce the precision of the model’s weights and activations. The result is reduced model size, lower memory requirements, and improved computational efficiency.
However, the reduction achievable by these techniques is not substantial. Therefore, other solutions must be explored, such as the removal of architectural elements in diffusion models.Â
The work presented in this article reflects this motivation and unveils the significant potential of classical architectural compression techniques in achieving smaller and faster diffusion models. The pre-training pipeline is depicted in the figure below.
The procedure removes multiple residual and attention blocks from the U-Net architecture of a Stable Diffusion Model (SDM) and pre-trains the compact (or student) model using feature-level knowledge distillation (KD).Â
Some intriguing insights about the architecture removal include down, up, and mid stages.
For down and up stages, this approach reduces the number of unnecessary residual and cross-attention blocks in the U-Net architecture while preserving crucial spatial information processing. It aligns with the DistilBERT method and enables the use of pre-trained weights for initialization, resulting in a more efficient and compact model.
Surprisingly, removing the mid-stage from the original U-Net has little impact on generation quality while significantly reducing parameters. This trade-off between compute efficiency and generation quality makes it a viable option for optimization.
According to the authors, each student achieves an outstanding ability in high-quality text-to-image (T2I) synthesis after distilling the knowledge from the teacher. Compared to Stable Diffusion, with 1.04 billion parameters and an FID score of 13.05, the BK-SDM-Base model, with 0.76 billion parameters, achieves an FID score of 15.76. Similarly, the BK-SDM-Small model, with 0.66 billion parameters, achieves an FID score of 16.98, and the BK-SDM-Tiny model, with 0.50 billion parameters, achieves an FID score of 17.12.
Some results are reported here to visually compare the proposed approaches and the state-of-the-art approaches.
This summary of a novel compression technique for Text-to-Image (T2I) diffusion models focuses on the intelligent removal of architectural elements and distillation strategies.
Check Out The Paper. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.