mPLUG-Owl3-2B-241014开源多模态大模型 - 快速理解长图像序列难题

首页

Mplug Owl3 2B 241014

由 mPLUG 开发

mPLUG-Owl3 是一款先进的多模态大语言模型，专注于解决长图像序列理解的挑战，通过超注意力机制显著提升处理速度和序列长度。

文本生成图像

Safetensors

英语开源协议:Apache-2.0 #长图像序列理解 #超注意力机制 #多模态对话

下载量 2,680

发布时间 : 10/15/2024

模型简介

mPLUG-Owl3 是一款多模态大语言模型，旨在处理长图像序列理解任务。它通过超注意力机制（Hyper Attention）提升了处理速度，并能处理更长的视觉序列。该模型在单图像、多图像及视频任务上均表现出色。

模型特点

超注意力机制

通过超注意力机制（Hyper Attention），将长视觉序列理解的速度提升六倍，并能处理长度达八倍的视觉序列。

多模态支持

支持单图像、多图像及视频任务，具备强大的多模态理解能力。

高效推理

优化后的架构和实现使得模型在保持高性能的同时，具备较高的推理效率。

模型能力

视觉问答

图像描述生成

视频描述生成

多模态对话

使用案例

视觉理解

图像描述生成

输入一张图片，模型能够生成详细的描述。

生成准确且详细的图像描述。

视频描述生成

输入一段视频，模型能够生成视频内容的描述。

生成连贯且准确的视频描述。

多模态对话

与图像对话

用户上传一张图片并与模型进行对话，模型能够根据图片内容回答问题。

提供与图片内容相关的准确回答。

与视频对话

用户上传一段视频并与模型进行对话，模型能够根据视频内容回答问题。

提供与视频内容相关的准确回答。

🚀 mPLUG-Owl3

mPLUG-Owl3是一款先进的多模态大语言模型，旨在应对长图像序列理解的挑战。我们提出了超注意力（Hyper Attention）机制，将多模态大语言模型中长视觉序列理解的速度提升了六倍，使其能够处理长度达八倍的视觉序列。同时，该模型在单图像、多图像和视频任务上均保持出色的性能。

🚀 快速开始

加载mPLUG-Owl3

目前仅支持attn_implementation取值为['sdpa', 'flash_attention_2']。

import torch
from transformers import AutoConfig, AutoModel
model_path = 'mPLUG/mPLUG-Owl3-2B-241014'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
# model = mPLUGOwl3Model(config).cuda().half()
model = AutoModel.from_pretrained(model_path, attn_implementation='sdpa', torch_dtype=torch.half, trust_remote_code=True)
model.eval().cuda()

与图像进行交互

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu 
model_path = 'mPLUG/mPLUG-Owl3-2B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

image = Image.new('RGB', (500, 500), color='red')

messages = [
    {"role": "user", "content": """<|image|>
Describe this image."""},
    {"role": "assistant", "content": ""}
]

inputs = processor(messages, images=[image], videos=None)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})

g = model.generate(**inputs)
print(g)

与视频进行交互

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu    # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-2B-241014'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)

messages = [
    {"role": "user", "content": """<|video|>
Describe this video."""},
    {"role": "assistant", "content": ""}
]

videos = ['/nas-mmu-data/examples/car_room.mp4']

MAX_NUM_FRAMES=16

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)

inputs.to('cuda')
inputs.update({
    'tokenizer': tokenizer,
    'max_new_tokens':100,
    'decode_text':True,
})

g = model.generate(**inputs)
print(g)

📄 许可证

本项目采用Apache-2.0许可证。

📚 引用

如果您觉得我们的工作有帮助，请引用我们的论文：

@misc{ye2024mplugowl3longimagesequenceunderstanding,
      title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models}, 
      author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
      year={2024},
      eprint={2408.04840},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04840}, 
}

项目Github地址：mPLUG-Owl