许可协议:Apache-2.0
支持语言:
mPLUG-Owl3 多模态大语言模型
模型介绍
mPLUG-Owl3 是一款前沿的多模态大语言模型,专为解决长图像序列理解难题而设计。我们提出的"超注意力机制"(Hyper Attention)将多模态大语言模型的长视觉序列处理速度提升六倍,支持处理八倍长度的视觉序列,同时在单图、多图和视频任务上保持卓越性能。
项目地址:mPLUG-Owl GitHub
快速开始
加载 mPLUG-Owl3 模型(当前仅支持 sdpa
和 flash_attention_2
两种注意力实现方式):
import torch
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
config = mPLUGOwl3Config.from_pretrained(model_path)
print(config)
model = mPLUGOwl3Model.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half)
model.eval().cuda()
图像对话示例:
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
image = Image.new('RGB', (500, 500), color='red')
messages = [
{"role": "user", "content": """<|image|>
请描述这张图片。"""},
{"role": "assistant", "content": ""}
]
inputs = processor(messages, images=[image], videos=None)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
视频对话示例:
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
messages = [
{"role": "user", "content": """<|video|>
请描述这段视频。"""},
{"role": "assistant", "content": ""}
]
videos = ['/nas-mmu-data/examples/car_room.mp4']
MAX_NUM_FRAMES=16
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1)
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('采样帧数:', len(frames))
return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
引用声明
如果我们的工作对您有帮助,欢迎引用:
@misc{ye2024mplugowl3longimagesequenceunderstanding,
title={mPLUG-Owl3:多模态大语言模型的长图像序列理解研究},
author={叶佳波、徐海洋、刘浩伟、胡安文、严明、钱琦、张骥、黄斐、周靖人},
year={2024},
eprint={2408.04840},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.04840},
}