Description: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
machine learning (3390) deep learning (1121) computer vision (756) neural networks (217) caption (74) nocaps (4) open images (2)
An overview of MUGEN.
Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun . We made substantial modifications to make the game richer by introducing audio and enabling new interactions. W