InternVideo2-Chat-8B开源视频理解模型 - 免费支持视频语义理解与人机交互

首页

Internvideo2 Chat 8B

由 OpenGVLab 开发

InternVideo2-Chat-8B是一个结合大型语言模型(LLM)和视频BLIP的视频理解模型，通过渐进式学习方案构建，能够进行视频语义理解和人机交互。

视频生成文本

Transformers

英语开源协议:MIT #视频语义理解 #渐进式学习 #多模态交互

下载量 492

发布时间 : 8/1/2024

模型简介

该模型通过将InternVideo2作为视频编码器，并与Mistral-7B等大型语言模型结合，构建了VideoLLM进行微调，提升了视频语义内涵和人机交互友好性。

模型特点

渐进式学习方案

采用VideoChat的渐进式学习方案，训练视频BLIP模块与开源LLM进行交互，视频编码器会持续更新。

高性能视频理解

在MVBench和VideoMME等基准测试中表现出色，能够准确理解视频内容并进行语义分析。

多模态交互

结合视频和文本输入，支持复杂的多模态交互任务，如视频内容描述和问答。

模型能力

视频内容理解

视频问答

视频内容描述

多模态交互

使用案例

视频分析

视频内容描述

对视频内容进行详细描述，如动作细节、场景信息等。

视频展示了一位女士在可俯瞰山景的屋顶练习瑜伽。她首先以手膝支撑姿势开始，随后过渡到下犬式，最终以站立姿势结束。

视频问答

回答关于视频内容的特定问题，如人物服装、动作细节等。

视频中的女士穿着黑色背心和灰色瑜伽裤。

人机交互

自然语言交互

支持通过自然语言与模型进行交互，获取视频内容的详细信息。

🚀 InternVideo2-Chat-8B

InternVideo2-Chat-8B是一个视频文本交互模型，通过结合视频编码器和大语言模型，提升了视频语义理解和人机交互的友好性，能处理多种视频相关的问答任务。

[📂 GitHub] [📜 技术报告] [🗨️ 聊天演示]

🚀 快速开始

为了进一步丰富 InternVideo2 中嵌入的语义，并提高其在人机通信中的易用性，我们将InternVideo2与大语言模型（LLM）和视频BLIP集成到一个VideoLLM中进行微调。我们采用了 VideoChat 中的渐进式学习方案，使用InternVideo2作为视频编码器，并训练了一个视频BLIP以与开源LLM进行通信。在训练过程中，视频编码器会被更新。详细的训练方法请参考 VideoChat。

该模型的基础大语言模型是Mistral-7B。在使用之前，请确保你已经获得了Mistral-7B的访问权限，如果尚未获得，请前往Mistral-7B获取访问权限，并将你的 HF_token 添加到环境变量中。

✨ 主要特性

📈 性能表现

模型	MVBench	无字幕VideoMME
InternVideo2-Chat-8B	60.3	41.9
InternVideo2-Chat-8B-HD	65.4	46.1
InternVideo2-Chat-8B-HD-F16	67.5	49.4
InternVideo2-Chat-8B-InternLM	61.9	49.1

📦 安装指南

申请该项目的权限和基础大语言模型的访问权限。
将HF用户访问令牌填充到环境变量中。

export HF_TOKEN=hf_....

如果你不知道如何获取以 "hf_" 开头的令牌，请参考：如何获取HF用户访问令牌。 3. 确保安装 transformers >= 4.39.0 和 peft==0.5.0。

pip install transformers==4.39.1
pip install peft==0.5.0
pip install timm easydict einops

从 pip_requirements 安装必要的Python包。

💻 使用示例

基础用法

import os
token = os.environ['HF_TOKEN']
import torch

tokenizer =  AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2-Chat-8B', trust_remote_code=True, use_fast=False)

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained(
    'OpenGVLab/InternVideo2-Chat-8B',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).cuda()

from decord import VideoReader, cpu
from PIL import Image
import numpy as np
import numpy as np
import decord
from decord import VideoReader, cpu
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.transforms import PILToTensor
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
decord.bridge.set_bridge("torch")

def get_index(num_frames, num_segments):
    seg_size = float(num_frames - 1) / num_segments
    start = int(seg_size / 2)
    offsets = np.array([
        start + int(np.round(seg_size * idx)) for idx in range(num_segments)
    ])
    return offsets


def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    num_frames = len(vr)
    frame_indices = get_index(num_frames, num_segments)

    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)

    transform = transforms.Compose([
        transforms.Lambda(lambda x: x.float().div(255.0)),
        transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.CenterCrop(224),
        transforms.Normalize(mean, std)
    ])

    frames = vr.get_batch(frame_indices)
    frames = frames.permute(0, 3, 1, 2)
    frames = transform(frames)

    T_, C, H, W = frames.shape
        
    if return_msg:
        fps = float(vr.get_avg_fps())
        sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
        # " " should be added in the start and end
        msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
        return frames, msg
    else:
        return frames

video_path = "yoga.mp4"
# sample uniformly 8 frames from the video
video_tensor = load_video(video_path, num_segments=8, return_msg=False)
video_tensor = video_tensor.to(model.device)

chat_history= []
response, chat_history = model.chat(tokenizer, '', 'describe the action step by step.', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
# The video shows a woman performing yoga on a rooftop with a beautiful view of the mountains in the background. She starts by standing on her hands and knees, then moves into a downward dog position, and finally ends with a standing position. Throughout the video, she maintains a steady and fluid movement, focusing on her breath and alignment. The video is a great example of how yoga can be practiced in different environments and how it can be a great way to connect with nature and find inner peace.

response, chat_history = model.chat(tokenizer, '', 'What is she wearing?', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
# The woman in the video is wearing a black tank top and grey yoga pants.
print(response)

✏️ 引用说明

如果这项工作对你的研究有帮助，请考虑引用InternVideo和VideoChat。

@article{wang2024internvideo2,
  title={Internvideo2: Scaling video foundation models for multimodal video understanding},
  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
  journal={arXiv preprint arXiv:2403.15377},
  year={2024}
}

@article{li2023videochat,
  title={Videochat: Chat-centric video understanding},
  author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}

📄 许可证

本项目采用MIT许可证。

⚠️ 重要提示

你同意不使用该模型进行对人类受试者造成伤害的实验。

属性	详情
模型类型	视频文本交互模型
训练数据	未提及
许可证	MIT
管道标签	视频文本到文本
额外的访问权限提示	你同意不使用该模型进行对人类受试者造成伤害的实验。
额外的访问权限字段	姓名、公司/组织、国家、电子邮件
语言	英文
标签	视频