Video-RAG

Video Demo

Introduction from the podcast AI Agents Unleashed: https://www.youtube.com/watch?v=WTs3xHicR_0

Introduction

Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.

Experimental results on Video-MME

By default, this leaderboard is sorted by results with Video-RAG. To view other sorted results, please click on the corresponding cell.

The results are kept on updating!

#	Model	LLM Params	Frames	Overall (%)			Short Video (%)			Medium Video (%)			Long Video (%)
#	Model	LLM Params	Frames	w/o subs	w sub	w Ours	w/o sub	w subs	w Ours	w/o sub	w subs	w Ours	w/o sub	w subs	w Ours
	Gemini 1.5 Pro Google	-	1/0.5 fps	75.0	81.3	-	81.7	84.5	-	74.3	81.0	-	67.4	77.4	-
	LLaVA-Video Bytedance & NTU S-Lab	72B	64	70.3	75.9	77.4	80.1	81.8	82.8	68.7	73.8	76.3	62.1	72.2	73.1
	GPT-4o OpenAI	-	384	71.9	77.2	-	80.0	82.8	-	70.3	76.6	-	65.3	72.1	-
	Qwen2-VL Alibaba	72B	32	64.9	71.9	72.9	75.0	76.7	77.4	63.3	69.9	70.2	56.3	69.2	71.0
	Long-LLaVA Amazon	7B	64	52.9	57.1	62.6	61.9	66.2	67.1	51.4	54.7	60.4	45.4	50.3	60.1
	LongVA NTU S-Lab	7B	128	52.6	54.3	62.0	61.1	61.6	66.1	50.4	53.6	60.4	46.2	47.6	59.4
	LLaVA-NeXT-Video NTU	7B	16	43.0	47.7	50.0	49.4	51.8	56.6	43.0	46.4	47.4	36.7	44.9	46.0
	Video-LLaVA PKU	7B	8	39.9	41.6	45.0	45.3	46.1	49.5	38.0	40.7	43.0	36.2	38.1	42.5

Highlights

Comparison of the performance of Video-RAG with LLaVA-Video-72B, Gemini-1.5-Pro, and GPT-4o across various benchmarks, including the sub-tasks from Video-MME (here we focus only on those that outperform Gemini-1.5-Pro), LongVideoBench, and MLVU benchmarks.

Examples

We apply Video-RAG on LLaVA-Video for visualization.

🔗 Full Video Link: https://www.youtube.com/watch?v=fFjv93ACGo8

🔗 Full Video Link: https://www.youtube.com/watch?v=3m29MQ-qPfg

🔗 Full Video Link: https://www.youtube.com/watch?v=QHFy-nWNJYk

🔗 Full Video Link: https://www.youtube.com/watch?v=9HPFkUhOp1Q

Framework of Video-RAG

In the query decouple phase, the LVLM is prompted to generate a retrieval request for auxiliary texts. Next, in the auxiliary text generation and retrieval phase, the video is processed in parallel to extract three types of textual information (OCR, ASR, and object detection), and the relevant text is retrieved as the auxiliary text. Finally, in the integration and generation phase, auxiliary texts are combined with the query and the video to generate the response.

Visually-aligned Visulization

Grad-CAM visualizations of the last hidden state heatmap along with t-SNE visualizations of the user's query and keyframe features of the first example shown in Examples section. As demonstrated, the retrieved auxiliary texts help cross-modality alignment by assisting the model to pay more attention to query-relevant keyframes and thus generate more robust and accurate answers to the user's query.

@misc{luo2024videoragvisuallyalignedretrievalaugmentedlong,
    title={Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension}, 
    author={Yongdong Luo and Xiawu Zheng and Xiao Yang and Guilin Li and Haojia Lin and Jinfa Huang and Jiayi Ji and Fei Chao and Jiebo Luo and Rongrong Ji},
    year={2024},
    eprint={2411.13093},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.13093}, 
  }

Video-RAG

Visually-aligned Retrieval-Augmented Long Video Comprehension

Video Demo

Introduction

Experimental results on Video-MME

Video-RAG

Highlights

Examples

Framework of Video-RAG

Visually-aligned Visulization

Citation

Visually-aligned Retrieval-Augmented
Long Video Comprehension