Video-R1-7B开源多模态大模型 - 免费理解视频内容并精准回答问题

首页

Video R1 7B

由 Video-R1 开发

Video-R1-7B是基于Qwen2.5-VL-7B-Instruct优化的多模态大语言模型，专注于视频推理任务，能够理解视频内容并回答相关问题。

视频生成文本

Transformers

英语开源协议:Apache-2.0 #视频推理增强 #多模态大语言模型 #开放题解答

下载量 2,129

发布时间 : 3/27/2025

模型简介

该模型通过强化视频推理能力，能够处理视频输入并生成文本回答，支持多种问题类型如选择题、开放题等。

模型特点

视频推理能力

能够理解视频内容并进行深入推理，回答与视频相关的复杂问题。

多模态处理

支持视频和文本的联合输入，实现多模态信息的融合处理。

自然语言推理

在推理过程中使用自然语言表达思考过程，增强可解释性。

模型能力

视频内容理解

多模态推理

文本生成

问题回答

使用案例

教育

视频教学问答

学生可以上传教学视频并提问，模型能够分析视频内容并回答问题。

提高学习效率，增强对视频内容的理解。

工业

工业视频分析

分析工业视频中的操作流程，回答关于操作步骤或问题原因的问题。

帮助工程师快速定位问题，提高生产效率。

🚀 Video-R1-7B模型

本仓库包含了 Video-R1: Reinforcing Video Reasoning in MLLMs 中所介绍的 Video-R1-7B 模型。该模型可用于视频推理相关任务，为多模态大语言模型在视频领域的应用提供了有力支持。

🚀 快速开始

本仓库包含了 Video-R1: Reinforcing Video Reasoning in MLLMs 中所介绍的 Video-R1-7B 模型。

若要进行训练和评估，请参考代码：https://github.com/tulerfeng/Video-R1

若要进行单示例推理，可参考：https://github.com/tulerfeng/Video-R1/blob/main/src/inference_example.py

💻 使用示例

基础用法

import os
import torch
from vllm import LLM, SamplingParams
from transformers import AutoProcessor, AutoTokenizer
from qwen_vl_utils import process_vision_info

# Set model path
model_path = "Video-R1/Video-R1-7B"

# Set video path and question
video_path = "./src/example_video/video1.mp4"
question = "Which move motion in the video lose the system energy?"

# Choose the question type from 'multiple choice', 'numerical', 'OCR', 'free-form', 'regression'
problem_type = 'free-form'

# Initialize the LLM
llm = LLM(
    model=model_path,
    tensor_parallel_size=1,
    max_model_len=81920,
    gpu_memory_utilization=0.8,
    limit_mm_per_prompt={"video": 1, "image": 1},
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    max_tokens=1024,
)

# Load processor and tokenizer
processor = AutoProcessor.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.padding_side = "left"
processor.tokenizer = tokenizer

# Prompt template
QUESTION_TEMPLATE = (
    "{Question}\n"
    "Please think about this question as if you were a human pondering deeply. "
    "Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions "
    "It's encouraged to include self-reflection or verification in the reasoning process. "
    "Provide your detailed reasoning between the <think> and </think> tags, and then give your final answer between the <answer> and </answer> tags."
)

# Question type 
TYPE_TEMPLATE = {
    "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the <answer> </answer> tags.",
    "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
    "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the <answer> </answer> tags.",
    "free-form": " Please provide your text answer within the <answer> </answer> tags.",
    "regression": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags."
}

# Construct multimodal message
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 200704, # max pixels for each frame
                "nframes": 32 # max frame number
            },
            {
                "type": "text",
                "text": QUESTION_TEMPLATE.format(Question=question) + TYPE_TEMPLATE[problem_type]
            },
        ],
    }
]

# Convert to prompt string
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Process video input
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

# Prepare vllm input
llm_inputs = [{
    "prompt": prompt,
    "multi_modal_data": {"video": video_inputs[0]},
    "mm_processor_kwargs": {key: val[0] for key, val in video_kwargs.items()},
}]

# Run inference
outputs = llm.generate(llm_inputs, sampling_params=sampling_params)
output_text = outputs[0].outputs[0].text

print(output_text)