AI

Baidu AI Researchers Introduce VideoGen: A New Text-to-Video Generation Approach That Can Generate High-Definition Video With High Frame Fidelity

2 Mins read

Text-to-image (T2I) generation systems like DALL-E2, Imagen, Cogview, Latent Diffusion, and others have come a long way in recent years. On the other hand, text-to-video (T2V) generation remains a difficult issue due to the need for high-quality visual content and temporally smooth, realistic motion corresponding to the text. In addition, large-scale databases of text-video combinations are very hard to come across. 

A recent research by Baidu Inc. introduces VideoGen, a method for creating a high-quality, seamless movie from textual descriptions. To help direct the creation of T2V, the researchers first built a high-quality image using a T2I model. Then, they use a cascaded latent video diffusion module that generates a series of high-resolution smooth latent representations based on the reference image and the text description. When necessary, they also employ a flow-based approach to upsample the latent representation sequence in time. In the end, the team trained a video decoder to convert the sequence of latent representations into an actual video.

Creating a reference image with the help of a T2I model has two distinct advantages. 

  1. The resulting video’s visual quality has improved. The proposed method takes advantage of the T2I model to draw from the much larger dataset of image-text pairs, which is more diverse and information-rich than the dataset of video-text pairs. Compared to Imagen Video, which uses image-text pairings for joint training, this method is more efficient during the training phase. 
  2. A cascaded latent video diffusion model can be guided by a reference image, allowing it to learn video dynamics rather than visual content. The team believes this is an added benefit above methods that only use the T2I model parameters.

The team also mentions that textual description is unnecessary for their video decoder to produce a movie from the latent representation sequence. By doing so, they train the video decoder on a bigger data pool, including video-text pairs and unlabeled (unpaired) films. As a result, this method improves the smoothness and realism of the created video’s motion thanks to the high-quality video data we use.

As findings suggest, VideoGen represents a significant improvement over previous methods of text-to-video generation in terms of both qualitative and quantitative evaluation.


Check out the Paper and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..


Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.



Source link

Related posts
AI

Researchers from MIT and CUHK Propose LongLoRA (Long Low-Rank Adaptation), An Efficient Fine-Tuning AI Approach For Long Context Large Language Models (LLMs)

3 Mins read
The introduction of Large language models (LLMs) has brought a significant level of advancement in the field of Artificial Intelligence. Based on…
AI

Re-imagining the opera of the future | MIT News

6 Mins read
In the mid-1980s, composer Tod Machover came across a copy of Philip K. Dick’s science fiction novel “VALIS” in a Parisian bookstore….
AI

A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction

11 Mins read
The Amazon EU Design and Construction (Amazon D&C) team is the engineering team designing and constructing Amazon Warehouses across Europe and the…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *