Researchers from UCLA and Apple Introduce STIV: A Scalable AI Framework for Text and Image Conditioned Video Generation
3 Mins read
Video generation has improved with models like Sora, which uses the Diffusion Transformer (DiT) architecture. While text-to-video (T2V) models have advanced, they…