🚀 Video-R1-7B模型
本仓库包含了 Video-R1: Reinforcing Video Reasoning in MLLMs 中所介绍的 Video-R1-7B 模型。该模型可用于视频推理相关任务,为多模态大语言模型在视频领域的应用提供了有力支持。
🚀 快速开始
本仓库包含了 Video-R1: Reinforcing Video Reasoning in MLLMs 中所介绍的 Video-R1-7B 模型。
若要进行训练和评估,请参考代码:https://github.com/tulerfeng/Video-R1
若要进行单示例推理,可参考:https://github.com/tulerfeng/Video-R1/blob/main/src/inference_example.py
💻 使用示例
基础用法
import os
import torch
from vllm import LLM, SamplingParams
from transformers import AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info
model_path = "Video-R1/Video-R1-7B"
video_path = "./src/example_video/video1.mp4"
question = "Which move motion in the video lose the system energy?"
problem_type = 'free-form'
llm = LLM(
model=model_path,
tensor_parallel_size=1,
max_model_len=81920,
gpu_memory_utilization=0.8,
limit_mm_per_prompt={"video": 1, "image": 1},
)
sampling_params = SamplingParams(
temperature=0.1,
top_p=0.001,
max_tokens=1024,
)
processor = AutoProcessor.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.padding_side = "left"
processor.tokenizer = tokenizer
QUESTION_TEMPLATE = (
"{Question}\n"
"Please think about this question as if you were a human pondering deeply. "
"Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions "
"It's encouraged to include self-reflection or verification in the reasoning process. "
"Provide your detailed reasoning between the <think> and </think> tags, and then give your final answer between the <answer> and </answer> tags."
)
TYPE_TEMPLATE = {
"multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the <answer> </answer> tags.",
"numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
"OCR": " Please transcribe text from the image/video clearly and provide your text answer within the <answer> </answer> tags.",
"free-form": " Please provide your text answer within the <answer> </answer> tags.",
"regression": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags."
}
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": video_path,
"max_pixels": 200704,
"nframes": 32
},
{
"type": "text",
"text": QUESTION_TEMPLATE.format(Question=question) + TYPE_TEMPLATE[problem_type]
},
],
}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
llm_inputs = [{
"prompt": prompt,
"multi_modal_data": {"video": video_inputs[0]},
"mm_processor_kwargs": {key: val[0] for key, val in video_kwargs.items()},
}]
outputs = llm.generate(llm_inputs, sampling_params=sampling_params)
output_text = outputs[0].outputs[0].text
print(output_text)
📄 许可证
本项目采用 Apache-2.0 许可证。
📋 模型信息
属性 |
详情 |
模型类型 |
视频文本到文本模型 |
训练数据集 |
Video-R1/Video-R1-data |
评估指标 |
准确率 |
基础模型 |
Qwen/Qwen2.5-VL-7B-Instruct |