数据集:
- shenxq/OneVision
- shenxq/VideoChat2
基础模型:
- Vision-CAIR/LongVU_Qwen2_7B_img
任务标签: 视频-文本到文本
模型索引:
- 名称: llava-onevision-qwen-7b-ov
结果:
- 任务:
类型: 多模态
数据集:
名称: EgoSchema
类型: egoschema
指标:
- 类型: 准确率
值: 67.6
名称: 准确率
已验证: 是
- 任务:
类型: 多模态
数据集:
名称: MLVU
类型: mlvu
指标:
- 类型: 准确率
值: 65.4
名称: 准确率
已验证: 是
- 任务:
类型: 多模态
数据集:
名称: MVBench
类型: mvbench
指标:
- 类型: 准确率
值: 66.9
名称: 准确率
已验证: 是
- 任务:
类型: 多模态
数据集:
名称: VideoMME
类型: videomme
指标:
- 类型: 准确率
值: 60.6
名称: 准确率
已验证: 是
许可证: apache-2.0
LongVU
本仓库包含基于Qwen2-7B的模型,如论文LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding所述。
在HF演示上体验该模型。
使用
我们提供了使用模型的简单生成流程。更多细节,请参考Github
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
DEFAULT_IMAGE_TOKEN,
IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
KeywordsStoppingCriteria,
process_images,
tokenizer_image_token,
)
from decord import cpu, VideoReader
tokenizer, model, image_processor, context_len = load_pretrained_model(
"./checkpoints/longvu_qwen", None, "cambrian_qwen",
)
model.eval()
video_path = "./examples/video1.mp4"
qs = "详细描述这个视频"
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
video = []
for frame_index in frame_indices:
img = vr[frame_index].asnumpy()
video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]
qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=video,
image_sizes=image_sizes,
do_sample=False,
temperature=0.2,
max_new_tokens=128,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
引用
@article{shen2024longvu,
title={LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
author={Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
journal={arXiv:2410.17434},
year={2024}
}