DeVAn: Dense Video Annotation for Video-Language Models

Tingkai Liu¹, Yunzhe Tao¹, Haogeng Liu^2,3, Qihang Fan^2,3, Ding Zhou¹
Huaibo Huang², Ran He², Hongxia Yang¹

¹ByteDance, Inc.
²MAIS & CRIPAC, Institute of Automation, Chinese Academy of Sciences, China
³School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

Overview

We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visual-language models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given summary. Given the novel nature of the paragraph-length video summarization task, we compared different existing evaluation metrics and their alignment with human preferences and found that model-based evaluation metrics provide more semantically-oriented and human-aligned evaluation. Finally, we benchmarked a wide range of current video-language models on DeVAn, and we aim for DeVAn to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks.

Statistics

DeVAn is a multi-modal dataset containing 8.5K video clips carefully selected from previously published YouTube-based video datasets (YouTube-8M and YT-Temporal-1B) that integrate visual and auditory information. Over the span of 10 months, a team of 24 human annotators (college and graduate level students) created 5 short captions (1 sentence each) and 5 long summaries (3-10 sentences) for each video clip, resulting in a rich and comprehensive human-annotated dataset that serves as a robust ground truth for subsequent model training and evaluation.

Examples

Leaderboard

		Caption							Summary
		Generation Metrics				Retrieval Metrics			Generation Metrics				Retrieval Metrics
Model	Audio	BLEU-4	ROUGE-L	CIDEr	BLEURT	R@1	R@5	R@10	BLEU-4	ROUGE-L	CIDEr	BLEURT	R@1	R@5	R@10
Human (Avg)	Raw	6.3	32.1	53.9	50.5	-	-	-	15.7	34.5	36.9	55.6	-	-	-
Human (Min)	Raw	4.5	29.5	47.1	48.6	-	-	-	12.4	32.1	30.9	53.6	-	-	-
ImageBind-LLM	N/A	0.3	20.0	2.1	34.0	-	-	-	1.5	22.7	1.1	45.8	-	-	-
Video-LLaMA2-Instruct 13B	N/A	0.1	7.9	0.0	47.2	-	-	-	0.5	18.2	0.0	39.9	-	-	-
Video-LLaMA2-Instruct 13B	Raw	0.1	7.9	0.0	47.1	-	-	-	0.5	18.2	0.0	40.0	-	-	-
Video-LLaMA2-Instruct 7B	N/A	0.1	10.8	0.0	43.6	-	-	-	0.5	19.1	0.0	43.9	-	-	-
Video-LLaMA2-Instruct 7B	Raw	0.1	10.8	0.0	43.6	-	-	-	0.5	19.1	0.1	43.9	-	-	-
VideoChatGPT	N/A	0.4	19.9	2.0	40.5	-	-	-	2.9	24.4	5.8	46.7	-	-	-
VideoCoCa	N/A	0.2	13.2	2.3	17.6	32%	50%	58%	0.9	16.4	3.3	23.9	25%	41%	48%
VideoCoCa	ASR	0.8	20.3	9.2	21.9	36%	53%	59%	2.0	21.6	5.5	22.9	27%	42%	48%

DeVAn: Dense Video Annotation for Video-Language Models

DeVAn is a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips.

Overview

Statistics

Examples

Leaderboard

Evaluation