Using Instruction-Following LLM Hidden States as Conditioning for Video Diffusion Model

Title:Using Instruction-Following LLM Hidden States as Conditioning for Video Diffusion Model

Authors:R Hema Bhushan, Amritha G K, Sathvik S Malgikar, Pranav Ambiga and Badri Prasad V R

Conference:ECAI-2025

Tags:ARTIFICIAL INTELLIGENCE, CLIP SCORE, DIFFUSION, FVD, GENERATIVE AI, HIDDEN STATES, LARGE LANGUAGE MODEL, LATENT, MULTIMODAL, PROMPT ENGINEERING, UNET, VARIATIONAL AUTOENCODER and VIDEO GENERATION

Abstract:

Video generation has applications in several fields. With the advent of Generative AI, we see extensive research being conducted on video generation using AI. Through this project, we experiment the usage of LLM Hidden states as conditioning to train a Video Latent Diffusion Model to study their ability of passing richer semantic information about the video samples. We performed a comparative study of context retention abilities of LLMs in case of embeddings and hidden states separately. We create a pipeline with three major components - the LLM, a custom Bridge Network and the Diffusion UNet. We conduct our study using two different datasets - the Captioned Moving MNIST and a subset of the Sakuga-42M dataset. We conclude by evaluating our model variants on standard benchmarks and metrics, and state our findings, which could serve as ground for future work.