llm-grounded-video-diffusion.github.io - LLM-grounded Video Diffusion Models: Improving Text-to-Video Generation with Large Language Models

Description: LLM-grounded Video Diffusion Models: Improving Text-to-Video Generation with Large Language Models

Example domain paragraphs

Our LLM-grounded Video Diffusion Models (LVD) improves text-to-video generation by using a large language model to generate dynamic scene layouts from text and then guiding video diffusion models with these layouts, achieving realistic video generation that align with complex input prompts.

While the state-of-the-art open-source text-to-video models still cannot perform simple things such as faithfully depicting object dynamics according to the text prompt, LLM-grounded Video Diffusion Models (LVD) enables text-to-video diffusion models to generate videos that are much more aligned with complex input text prompts.

Our method LVD improves text-to-video diffusion models by turning the text-to-video generation into a two-stage pipeline. In stage 1, we introduce an LLM as the spatiotemporal planner that creates plans for video generation in the form of a dynamic scene layout (DSL). A DSL includes objects bounding boxes that are linked across the frames. In stage 2, we condition the video generation on the text and the DSL with our DSL-grounded video generator. Both stages are training-free: LLMs and diffusion models are

Links to llm-grounded-video-diffusion.github.io (3)