base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
license: other
tags:
- llama-factory
- full
- generated_from_trainer
pipeline_tag: video-text-to-text
model-index:
- name: bal_imb_cap_full_lr2e-4_epoch10.0_freezevisTrue_fps8
results: []
模型描述
该模型是基于Qwen/Qwen2.5-VL-7B-Instruct微调的版本,训练数据来源于当前公开可用的最高质量摄像机运动数据集。本预览模型是当前利用VQAScore进行摄像机运动分类和视频-文本检索任务的性能标杆(SOTA)。更多研究详情请访问我们的CameraBench项目主页。我们将持续更新基准测试和模型,敬请关注!
使用场景与限制
使用方式与原生Qwen2.5-VL模型完全兼容。本模型主要适用于视频中的摄像机运动分类以及视频-文本检索任务(当前两项任务的SOTA模型)。
快速演示如下:
生成式评分(用于分类与检索):
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
video_path = "file:///视频路径/video1.mp4"
text_description = "镜头向上倾斜"
question = f"该视频是否呈现「{text_description}」的场景?"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": video_path,
"fps": 8.0,
},
{"type": "text", "text": question},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs
)
inputs = inputs.to("cuda")
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=1,
do_sample=False,
output_scores=True,
return_dict_in_generate=True
)
scores = outputs.scores[0]
probs = torch.nn.functional.softmax(scores, dim=-1)
yes_token_id = processor.tokenizer.encode("是")[0]
score = probs[0, yes_token_id].item()
print(f"视频: {video_path}")
print(f"描述: '{text_description}'")
print(f"匹配度: {score:.4f}")
自然语言生成
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"chancharikm/qwen2.5-vl-7b-cam-motion-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///视频路径/video1.mp4",
"fps": 8.0,
},
{"type": "text", "text": "描述该视频中的摄像机运动"},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
fps=fps,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
训练与评估数据
训练与评估数据详见我们的代码库。
训练流程
我们使用LLaMA-Factory代码库进行模型微调。如需复现实验,请使用上述数据及以下超参数配置。
训练超参数
训练过程中使用的关键超参数:
- 学习率:1e-05
- 训练批次大小:4
- 评估批次大小:1
- 随机种子:42
- 分布式类型:多GPU
- 设备数量:8
- 梯度累积步数:8
- 总训练批次大小:256
- 总评估批次大小:8
- 优化器:使用adamw_torch,参数为betas=(0.9,0.999),epsilon=1e-08
- 学习率调度器类型:cosine
- 学习率预热比例:0.1
- 训练轮次:10.0
✏️ 引用
如果本研究成果对您有帮助,请引用以下文献:
@article{lin2025camerabench,
title={Towards Understanding Camera Motions in Any Video},
author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
journal={arXiv preprint arXiv:2504.15376},
year={2025},
}