codi-2.github.io - CoDi-2: Interleaved and In-Context Any-to-Any Generation

Description: CoDi-2: Interleaved and In-Context Any-to-Any Generation

Example domain paragraphs

We present CoDi-2, a versatile and interactive Multi-modal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and

CoDi-2 comprises a multimodal large language model that encompasses encoder and decoder for both audio and vision inputs, as well as a large language model. This architecture facilitates the decoding of image or audio inputs using diffusion models. In the training phase, our approach employs pixel loss obtained from the diffusion models alongside token loss, adhering to the standard causal generation loss.

Our model shows strong abilities in the following example tasks type which presents a unique approach to prompting models to generate or transform in-context multimodal content, including instructions, images, audio, video, and combinations thereof. A. Zero-Shot Prompting. Zero-shot prompting tasks require the model to reason and generate new content without any prior examples. B. One-Shot/Few-Shot Prompting. One-shot or few-shot prompting provides the model with one or a few examples to learn from before p

Links to codi-2.github.io (1)

codi-gen.github.io CoDi: Generate Anything from Anything All At Once through Composable Diffusion