license: mit
pipeline_tag: video-text-to-text
extra_gated_prompt: >-
您同意不使用该模型进行对人体有害的实验。
extra_gated_fields:
姓名: text
公司/组织: text
国家: text
电子邮箱: text
language:
InternVideo2-Chat-8B
[📂 GitHub] [📜 技术报告] [🗨️ 聊天演示]
为了进一步丰富InternVideo2的语义内涵并提升其人机交互友好性,我们通过将其与大型语言模型(LLM)和视频BLIP结合,构建了VideoLLM进行微调。采用VideoChat的渐进式学习方案,使用InternVideo2作为视频编码器,并训练视频BLIP模块与开源LLM进行交互。训练过程中视频编码器会持续更新。具体训练方法详见VideoChat。
该模型的基座LLM为Mistral-7B。使用前请确保已获得Mistral-7B的访问权限,若未获取,请前往Mistral-7B获取权限,并将您的HF_token
添加到环境变量中。
📈 性能表现
🚀 使用指南
-
申请本项目及基座LLM的使用权限
-
将HF用户访问令牌填入环境变量
export HF_TOKEN=hf_....
若不清楚如何获取以"hf_"开头的令牌,请参考:如何获取HF用户访问令牌
- 确保安装
transformers >= 4.39.0, peft==0.5.0
pip install transformers==4.39.1
pip install peft==0.5.0
pip install timm easydict einops
从pip_requirements安装必要的Python包
- 视频输入推理示例
import os
token = os.environ['HF_TOKEN']
import torch
tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2-Chat-8B', trust_remote_code=True, use_fast=False)
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained(
'OpenGVLab/InternVideo2-Chat-8B',
torch_dtype=torch.bfloat16,
trust_remote_code=True).cuda()
from decord import VideoReader, cpu
from PIL import Image
import numpy as np
import numpy as np
import decord
from decord import VideoReader, cpu
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.transforms import PILToTensor
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
decord.bridge.set_bridge("torch")
def get_index(num_frames, num_segments):
seg_size = float(num_frames - 1) / num_segments
start = int(seg_size / 2)
offsets = np.array([
start + int(np.round(seg_size * idx)) for idx in range(num_segments)
])
return offsets
def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
num_frames = len(vr)
frame_indices = get_index(num_frames, num_segments)
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)
transform = transforms.Compose([
transforms.Lambda(lambda x: x.float().div(255.0)),
transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
transforms.CenterCrop(224),
transforms.Normalize(mean, std)
])
frames = vr.get_batch(frame_indices)
frames = frames.permute(0, 3, 1, 2)
frames = transform(frames)
T_, C, H, W = frames.shape
if return_msg:
fps = float(vr.get_avg_fps())
sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
msg = f"视频包含{len(frame_indices)}帧,采样时间点为{sec}秒。"
return frames, msg
else:
return frames
video_path = "yoga.mp4"
video_tensor = load_video(video_path, num_segments=8, return_msg=False)
video_tensor = video_tensor.to(model.device)
chat_history= []
response, chat_history = model.chat(tokenizer, '', '分步骤描述动作细节。', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
response, chat_history = model.chat(tokenizer, '', '她穿着什么服装?', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
✏️ 引用
如果本工作对您的研究有所帮助,请考虑引用InternVideo和VideoChat。
@article{wang2024internvideo2,
title={Internvideo2: Scaling video foundation models for multimodal video understanding},
author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
journal={arXiv preprint arXiv:2403.15377},
year={2024}
}
@article{li2023videochat,
title={Videochat: Chat-centric video understanding},
author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
journal={arXiv preprint arXiv:2305.06355},
year={2023}
}