videoxum.github.io - VideoXum: Cross-modal Visual and Textural Summarization of Videos

video captioning (5) video summarization (3) cross-modality (1)

Example domain paragraphs

Code Dataset Model Abstract Video summarization aims to distill the most important information from a source video into either an abridged video clip or a textual narrative. Existing methods often treat the generation of video and text summaries as independent tasks, thus neglecting the semantic correlation between visual and textual summarization. In other words, these methods only study a single modality as output without considering coherent video and text as outputs. In this work, we first introduce a n

In this study, we first propose VideoXum , an enriched large-scale dataset for cross-modal video summarization. The dataset is built on ActivityNet Captions , a large-scale public video captioning benchmark. We hire workers to annotate ten shortened video summaries for each long source video according to the corresponding captions. VideoXum contains 14K long videos with 140K pairs of aligned video and text summaries.

Illustration of our V2X-SUM task. A long source video ( bottom ) can be summarized into a shortened videoand a text narrative ( top ). The video and text summaries should be semantically aligned.

Links to videoxum.github.io (2)