Video-RAG

Visually-aligned Retrieval-Augmented
Long Video Comprehension

Yongdong Luo1 Xiawu Zheng1 Xiao Yang1 Guilin Li1 Haojia Lin1
Jinfa Huang2 Jiayi Ji1 Fei Chao1 Jiebo Luo2 Rongrong Ji1
1Xiamen University, 2University of Rochester

Video Demo

Introduction

Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.

Experimental results on Video-MME

By default, this leaderboard is sorted by results with Video-RAG. To view other sorted results, please click on the corresponding cell.

The results are kept on updating!

# Model LLM
Params
Frames Overall (%) Short Video (%) Medium Video (%) Long Video (%)
w/o subs w sub w Ours w/o sub w subs w Ours w/o sub w subs w Ours w/o sub w subs w Ours
Gemini 1.5 Pro

Google

- 1/0.5 fps 75.0 81.3 - 81.7 84.5 - 74.3 81.0 - 67.4 77.4 -
LLaVA-Video

Bytedance & NTU S-Lab

72B 64 68.6 75.9 77.4 78.8 81.8 82.8 68.5 73.8 76.3 58.7 72.2 73.1
GPT-4o

OpenAI

- 384 71.9 77.2 - 80.0 82.8 - 70.3 76.6 - 65.3 72.1 -
Qwen2-VL

Alibaba

72B 32 64.9 71.9 72.9 75.0 76.7 77.4 63.3 69.9 70.2 56.3 69.2 71.0
Long-LLaVA

Amazon

7B 64 52.9 57.1 62.6 61.9 66.2 67.1 51.4 54.7 60.4 45.4 50.3 60.1
LongVA

NTU S-Lab

7B 128 52.6 54.3 60.4 61.1 61.6 66.2 50.4 53.6 58.1 46.2 47.6 56.8
LLaVA-NeXT-Video

NTU

7B 16 43.0 47.7 50.0 49.4 51.8 56.6 43.0 46.4 47.4 36.7 44.9 46.0
Video-LLaVA

PKU

7B 8 39.9 41.6 45.0 45.3 46.1 49.5 38.0 40.7 43.0 36.2 38.1 42.5

Video-RAG

Highlights

grade-lv

Comparison of the performance of Video-RAG with LLaVA-Video-72B, Gemini-1.5-Pro, and GPT-4o across various benchmarks, including the sub-tasks from Video-MME (here we focus only on those that outperform Gemini-1.5-Pro), LongVideoBench, and MLVU benchmarks.

Examples

We apply Video-RAG on LLaVA-Video for visualization.

Framework of Video-RAG

grade-lv

In the query decouple phase, the LVLM is prompted to generate a retrieval request for auxiliary texts. Next, in the auxiliary text generation and retrieval phase, the video is processed in parallel to extract three types of textual information (OCR, ASR, and object detection), and the relevant text is retrieved as the auxiliary text. Finally, in the integration and generation phase, auxiliary texts are combined with the query and the video to generate the response.

Visually-aligned Visulization

grade-lv

Grad-CAM visualizations of the last hidden state heatmap along with t-SNE visualizations of the user's query and keyframe features of the first example shown in Examples section. As demonstrated, the retrieved auxiliary texts help cross-modality alignment by assisting the model to pay more attention to query-relevant keyframes and thus generate more robust and accurate answers to the user's query.

Citation

@misc{luo2024videoragvisuallyalignedretrievalaugmentedlong,
    title={Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension}, 
    author={Yongdong Luo and Xiawu Zheng and Xiao Yang and Guilin Li and Haojia Lin and Jinfa Huang and Jiayi Ji and Fei Chao and Jiebo Luo and Rongrong Ji},
    year={2024},
    eprint={2411.13093},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.13093}, 
  }