CogVLM2-Video开源视频理解模型 - 一分钟搞定视频理解，问答任务表现出色

首页

Cogvlm2 Video Llama3 Chat

由 THUDM 开发

CogVLM2-Video是一款高性能视频理解模型，在多项视频问答任务中实现最先进性能表现，能在一分钟内完成视频理解。

文本生成视频

Transformers

英语开源协议:其他 #视频问答 #多模态理解 #时序定位

下载量 2,384

发布时间 : 7/3/2024

模型简介

该模型专注于视频理解任务，具备出色的时间定位和事件分析能力，支持对视频内容进行深入问答和分析。

模型特点

高效视频理解

能在一分钟内完成视频内容理解，处理效率高

精准时间定位

可准确定位视频中特定事件发生的时间点

多任务性能优异

在MVBench、VideoChatGPT-Bench等多个基准测试中表现优异

模型能力

视频内容分析

事件时序理解

物体运动轨迹追踪

人物动作识别

视频问答

使用案例

视频内容分析

体育赛事分析

分析篮球比赛视频中的关键动作和得分时刻

能准确识别投篮、传球等关键动作及其时间点

野生动物行为研究

分析野生动物视频中的行为模式

能识别动物特定行为及其发生时间

智能监控

异常事件检测

监控视频中的异常行为识别

可检测异常行为并定位发生时间

🚀 CogVLM2-Video-Llama3-Chat

CogVLM2-Video-Llama3-Chat在多个视频问答任务中表现卓越，能够在一分钟内实现视频理解。本项目提供了示例视频，展示其视频理解和视频时间定位能力。

🚀 快速开始

本仓库提供的是chat版本模型，支持单轮对话。你可以在我们的 GitHub 上快速安装Python包依赖并运行模型推理。

✨ 主要特性

CogVLM2-Video在多个视频问答任务中达到了先进水平。
能够在一分钟内实现视频理解。
提供示例视频，展示视频理解和视频时间定位能力。

📊 基准测试

性能图表

下图展示了CogVLM2-Video在 MVBench、VideoChatGPT-Bench 和零样本视频问答数据集（MSVD-QA、MSRVTT-QA、ActivityNet-QA）上的性能。其中，VCG-* 指的是VideoChatGPTBench，ZS-* 指的是零样本视频问答数据集，MV-* 指的是MVBench中的主要类别。

定量评估

VideoChatGPT-Bench和零样本视频问答数据集性能

模型	VCG平均	VCG-CI	VCG-DO	VCG-CU	VCG-TU	VCG-CO	ZS平均
IG-VLM GPT4V	3.17	3.40	2.80	3.61	2.89	3.13	65.70
ST-LLM	3.15	3.23	3.05	3.74	2.93	2.81	62.90
ShareGPT4Video	未提供	未提供	未提供	未提供	未提供	未提供	46.50
VideoGPT+	3.28	3.27	3.18	3.74	2.83	3.39	61.20
VideoChat2_HD_mistral	3.10	3.40	2.91	3.72	2.65	2.84	57.70
PLLaVA-34B	3.32	3.60	3.20	3.90	2.67	3.25	68.10
CogVLM2-Video	3.41	3.49	3.46	3.87	2.98	3.23	66.60

MVBench数据集性能

模型	平均	AA	AC	AL	AP	AS	CO	CI	EN	ER	FA	FP	MA	MC	MD	OE	OI	OS	ST	SC	UA
IG-VLM GPT4V	43.7	72.0	39.0	40.5	63.5	55.5	52.0	11.0	31.0	59.0	46.5	47.5	22.5	12.0	12.0	18.5	59.0	29.5	83.5	45.0	73.5
ST-LLM	54.9	84.0	36.5	31.0	53.5	66.0	46.5	58.5	34.5	41.5	44.0	44.5	78.5	56.5	42.5	80.5	73.5	38.5	86.5	43.0	58.5
ShareGPT4Video	51.2	79.5	35.5	41.5	39.5	49.5	46.5	51.5	28.5	39.0	40.0	25.5	75.0	62.5	50.5	82.5	54.5	32.5	84.5	51.0	54.5
VideoGPT+	58.7	83.0	39.5	34.0	60.0	69.0	50.0	60.0	29.5	44.0	48.5	53.0	90.5	71.0	44.0	85.5	75.5	36.0	89.5	45.0	66.5
VideoChat2_HD_mistral	62.3	79.5	60.0	87.5	50.0	68.5	93.5	71.5	36.5	45.0	49.5	87.0	40.0	76.0	92.0	53.0	62.0	45.5	36.0	44.0	69.5
PLLaVA-34B	58.1	82.0	40.5	49.5	53.0	67.5	66.5	59.0	39.5	63.5	47.0	50.0	70.0	43.0	37.5	68.5	67.5	36.5	91.0	51.5	79.0
CogVLM2-Video	62.3	85.5	41.5	31.5	65.5	79.5	58.5	77.0	28.5	42.5	54.0	57.0	91.5	73.0	48.0	91.0	78.0	36.0	91.5	47.0	68.5

💻 使用示例

基础用法

# 对于MVBench
prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, select the best option that accurately addresses the question.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Short Answer:"
# 对于VideoChatGPT-Bench
prompt = f"Carefully watch the video and pay attention to the cause and sequence of events, the detail and movement of objects, and the action and pose of persons. Based on your observations, comprehensively answer the following question. Your answer should be long and cover all the related aspects\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"
# 对于零样本视频问答
prompt = f"The input consists of a sequence of key frames from a video. Answer the question comprehensively including all the possible verbs and nouns that can discribe the events, followed by significant events, characters, or objects that appear throughout the frames.\n " + f"{prompt.replace('Short Answer.', '')}\n" + "Answer:"