slowfast-video-mllm-qwen2开源视频多模态模型 - 平衡时空支持64帧视频理解

首页

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4

由 shi-labs 开发

采用慢-快架构的视频多模态大语言模型，平衡时间分辨率和空间细节，支持64帧视频理解

视频生成文本

Transformers

#视频理解 #多模态LLM #时空双token

下载量 184

发布时间 : 3/19/2025

模型简介

该模型创新性地采用慢-快双token策略处理视频输入，结合Qwen2-7B语言模型和ConvNeXt-576视觉编码器，在有限计算预算下实现高效的视频理解

模型特点

慢-快双token策略

通过快token快速浏览视频内容，慢token精细提取视觉细节，实现高效视频理解

高帧率处理

支持64帧视频输入，时间分辨率显著优于传统方法

线性复杂度交叉注意力

特制混合解码层实现文本对原始视频特征的线性复杂度交叉注意力

模型能力

视频内容理解

视频内容描述生成

多模态推理

长视频处理

使用案例

视频内容分析

视频内容描述

对输入视频生成详细的内容描述

在视频理解基准测试中优于纯自注意力基线

智能监控

监控视频分析

分析监控视频中的关键事件

🚀 视频多模态大语言模型的快慢架构 (Qwen2-7B, 64帧)

本仓库包含了快慢视频多模态大语言模型（Qwen2-7B、ConvNeXt-576、64帧、步长1/4） 模型，该模型在论文视频多模态大语言模型的快慢架构中被提出。

代码仓库 | HuggingFace 集合

✨ 主要特性

本模型引入了一种新颖的快慢架构，旨在解决在有限计算资源预算下，基于视频的多模态大语言模型（MLLMs）在平衡时间分辨率和空间细节方面的挑战。现有的方法通常会不可逆地压缩视频表示，从而丢失细节。

受人类先浏览视频再关注相关部分的方式启发，快慢设计采用了双令牌策略：

“快”视觉令牌：一组紧凑的压缩视频特征，与文本嵌入一起输入到大语言模型（Qwen2-7B-Instruct）中，以快速概览视频内容。
“慢”视觉令牌：未压缩的视频特征通过专门设计的混合解码器层与文本嵌入进行交叉注意力计算，从而能够以线性复杂度进行与指令相关的视觉细节提取。

这种方法允许处理更多的输入帧（例如，此检查点可处理64帧），同时保留空间细节，与仅使用自注意力的基线模型相比，在视频理解基准测试中取得了显著的性能提升。此检查点使用Qwen2-7B-Instruct作为基础大语言模型，并使用ConvNeXt-576作为视觉塔。

📦 安装指南

注意：此模型依赖于集成在 transformers 库中的自定义代码（LlavaQwenSlowFastForCausalLM）。请确保你已从官方仓库安装了必要的软件包，或者在加载模型时使用 trust_remote_code=True。

如果你在本地运行，请先克隆仓库并安装依赖项：

git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git
cd Slow-Fast-Video-Multimodal-LLM
pip install --upgrade pip
pip install -r requirements.txt
# 将克隆的仓库路径添加到你的PYTHONPATH或进行安装

💻 使用示例

基础用法

import torch
import os
import numpy as np
from decord import VideoReader, cpu
import requests # Required to download video

# Make sure the necessary llava modules are importable
# If not installed from the repo, trust_remote_code=True handles this
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.utils import disable_torch_init


def load_video(video_path, max_frames_num):
        """Helper function to load video frames."""
        vr = VideoReader(video_path, num_threads=4)
        total_frames = len(vr)

        # Ensure sparse sampling doesn't lead to fewer frames than requested
        if total_frames >= max_frames_num:
            # Uniformly sample frames across the video
            uniform_sampled_frames = np.linspace(0, total_frames - 1, max_frames_num, dtype=int)
            frame_idx = uniform_sampled_frames.tolist()
        else:
            # If video is shorter than max_frames_num, sample all frames and repeat the last
            frame_idx = list(range(total_frames))
            frame_idx.extend([total_frames - 1] * (max_frames_num - total_frames))

        try:
            spare_frames = vr.get_batch(frame_idx).asnumpy()
        except Exception as e:
            print(f"Error loading video frames: {e}")
            # Fallback or error handling: return None or raise exception
            # Example: return a black frame tensor of the expected shape
            # This part depends on how image_processor handles None or errors
            # For now, re-raising the exception might be best
            raise e

        return spare_frames

# Model configuration
model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4"
video_url = "https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/catinterrupt.mp4"
video_local_path = "catinterrupt.mp4"
question = "Please describe this video in detail."
max_frames = 64 # This checkpoint was trained with 64 frames

# Download the video if it doesn't exist
if not os.path.exists(video_local_path):
    print(f"Downloading video from {video_url}...")
    response = requests.get(video_url, stream=True)
    response.raise_for_status() # Raise an exception for bad status codes
    with open(video_local_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print("Download complete.")


# Load the model and processor
disable_torch_init()
model_name = get_model_name_from_path(model_path)

# Use trust_remote_code=True to load the custom architecture
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path,
    None,
    model_name,
    use_flash_attn=True,      # Use Flash Attention if available
    device_map="auto",        # Automatically distribute model across GPUs/CPU
    torch_dtype=torch.bfloat16, # Use bfloat16 for efficiency
    trust_remote_code=True
)

# Prepare the prompt
if model.config.mm_use_im_start_end:
    prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + " " + question
else:
    prompt = DEFAULT_IMAGE_TOKEN + " " + question

conv = conv_templates["qwen_1_5"].copy() # Use the appropriate conversation template
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
prompt_final = conv.get_prompt()

# Load and process video frames
print("Loading video...")
video_frames = load_video(video_local_path, max_frames_num=max_frames)
print(f"Video loaded, shape: {video_frames.shape}")

# Preprocess video frames
print("Preprocessing video...")
# Ensure video has shape (T, H, W, C) before preprocessing
video_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
video_tensor = video_tensor.to(model.device, dtype=torch.bfloat16)
videos = [video_tensor] # The model expects a list of video tensors
print(f"Video tensor processed, shape: {videos[0].shape}")


# Tokenize the prompt
input_ids = tokenizer_image_token(prompt_final, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device=model.device, non_blocking=True)
# Add batch dimension if necessary (tokenizer_image_token might already return batched)
if input_ids.ndim == 1:
    input_ids = input_ids.unsqueeze(0)
print(f"Input IDs processed, shape: {input_ids.shape}")


# Generate response
print("Generating response...")
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=videos, # Pass the processed video tensor list
        do_sample=True,
        temperature=0.2,
        top_p=1.0,
        num_beams=1,
        max_new_tokens=1024,
        use_cache=True
    )

# Decode and print the output
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"User input: {question}")
print(f"Model output: {outputs}")

📄 许可证

模型权重遵循 CC-BY-NC-4.0 许可证发布。代码遵循 Apache 2.0 许可证发布。用户必须遵守原始许可证的所有条款和条件，包括基础语言模型的特定许可证（Qwen2 许可证）。

📚 详细文档

引用

如果你觉得这项工作有用，请考虑引用该论文：

@misc{zhou2025slowfast,
      title={Slow-Fast Architecture for Video Multi-Modal Large Language Models},
      author={Yifei Zhou and Jiaming Zuo and Chen Change Loy and Chongyang Zhong and Xin Wang and Qi Wu and Weidong Cai and Xiaodong He and Qingzhong Wang and Lei Zhang and Marcelo H. Ang Jr and Boyang Li and Yanfeng Wang and Qinghai He and Fengbei Liu and Liangchen Luo and Jingdong Wang and Conghui He and Wenhai Wang},
      year={2025},
      eprint={2504.01328},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

(注意：作者列表可能会根据 arXiv 论文的更新而有所变化；如果有最终发布版本，请以其为准。)