VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B开源多模态模型

首页

Videochat Flash Qwen2 5 7B InternVideo2 1B

由 OpenGVLab 开发

基于InternVideo2-1B和Qwen2.5-7B构建的多模态视频文本模型，每帧仅使用16个标记，支持长达10,000帧的输入序列。

文本生成视频

Transformers

英语开源协议:Apache-2.0 #超长视频理解 #高效视频标记 #多模态问答

下载量 193

发布时间 : 2/19/2025

模型简介

该模型是一个高效的多模态视频文本处理模型，专注于视频理解和文本生成任务，特别适合长视频内容分析。

模型特点

高效视频处理

每帧仅使用16个标记，显著降低计算资源需求

超长上下文支持

通过Yarn技术扩展至128k上下文窗口，支持约10,000帧输入

多模态理解

结合视觉和语言模型，实现视频内容的深度理解

模型能力

视频内容理解

长视频分析

多模态推理

视频问答

使用案例

视频内容分析

长视频摘要

对长达数小时的视频内容进行关键信息提取和摘要

在长视频基准测试中准确率达64.5%

视频问答

回答关于视频内容的复杂问题

在MLVU数据集上准确率达73.4%

多模态理解

视频场景理解

识别和分析视频中的场景、动作和对象

在感知测试中准确率达76.3%

🚀 🦜VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B⚡

VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B 基于 InternVideo2-1B 和 Qwen2.5-7B 构建，每帧仅使用 16 个标记。通过利用 Yarn 将上下文窗口扩展到 128k（Qwen2 的原生上下文窗口为 32k），我们的模型支持输入序列最多约 10,000 帧。

⚠️ 重要提示

由于训练语料主要为英文，该模型仅具备基本的中文理解能力，为确保最佳性能，建议使用英文进行交互。

🚀 快速开始

安装依赖

首先，你需要安装 flash attention2 和其他一些模块。我们在下面提供一个简单的安装示例：

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
# 可选
pip install flash-attn --no-build-isolation

使用模型

from transformers import AutoModel, AutoTokenizer
import torch

# 模型设置
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # 是否使用全局压缩
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# 评估设置
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# 单轮对话
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# 多轮对话
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✨ 主要特性

基于 InternVideo2-1B 和 Qwen2.5-7B 构建，每帧仅使用 16 个标记。
利用 Yarn 将上下文窗口扩展到 128k，支持输入序列最多约 10,000 帧。

📈 性能表现

模型	MVBench	LongVideoBench	VideoMME(无字幕)	最大输入帧数
VideoChat-Flash-Qwen2_5-2B@448	70.0	58.3	57.0	10000
VideoChat-Flash-Qwen2-7B@224	73.2	64.2	64.0	10000
VideoChat-Flash-Qwen2_5-7B-1M@224	73.4	66.5	63.5	50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224	74.3	64.5	65.1	10000
VideoChat-Flash-Qwen2-7B@448	74.0	64.7	65.3	10000

📄 许可证

本项目采用 Apache-2.0 许可证。

✏️ 引用

@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}