DolphinVision 72b开源多模态模型 - 对流行模型拒收图像推理评论

首页

Dolphin Vision 72b

由 dphn 开发

DolphinVision 72b是一个多模态模型，能够对其他流行模型会拒绝处理的图像进行推理和评论。

图像生成文本

Transformers

开源协议:其他 #多模态推理 #无审查图像分析 #高精度OCR

下载量 296

发布时间 : 6/28/2024

模型简介

DolphinVision 72b是一个未经审查的多模态模型，由Quan Nguyen、Eric Hartford和Cognitive Computations精心策划并训练而成，能够处理图像和文本的多模态任务。

模型特点

多模态能力

能够同时处理图像和文本，进行复杂的多模态推理。

未经审查

能够对其他流行模型会拒绝处理的图像进行推理和评论。

高性能

在多个基准测试中表现优异，接近或超过GPT-4V的性能。

模型能力

图像描述

文本生成

OCR识别

多模态推理

使用案例

图像分析

OCR文本识别

从图像中提取并识别文本内容。

准确识别图像中的讽刺性文本内容。

内容生成

图像描述生成

根据图像生成详细的描述性文本。

生成准确且详细的图像描述。

🚀 DolphinVision 72b 🐬

DolphinVision 72b是一个多模态模型，由Quan Nguyen（qnguyen3/stablequan）、Eric Hartford和Cognitive Computations精心策划并训练而成。该模型未经审查，能够对其他流行模型会拒绝处理的图像进行推理和评论。

Discord链接：https://discord.gg/h3K4XGj2RH

模型图片

特别感谢以下赞助商

Crusoe Cloud：提供了出色的按需8xH100节点用于训练。
TensorWave：提供了8x mi300x节点用于评估和推理。

🚀 快速开始

运行示例代码

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# 禁用一些警告
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# 设置设备
torch.set_default_device('cuda')  # 也可以设置为 'cpu'

model_name = 'cognitivecomputations/dolphin-vision-72b'

# 创建模型
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True)

# 文本提示
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# 图像，示例图像可以在images文件夹中找到
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# 生成结果
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

💻 使用示例

示例一

示例图片

<|im_start|>user
: do OCR on all the text
<|im_start|>assistant
: 图像中的文本如下：

- STOP USING DOCKER
- 应用程序不应像数字萤火虫一样被困在虚拟容器中
- 经过多年的软件开发，却没有人质疑为什么我们要把程序放在虚拟的“船”里
- 让我们对单体应用进行容器化。我要快速启动17个微服务。这是完全疯狂的人才能想出的话
- 看看他们一直以来用来欺骗我们的巨大谎言
- 这些是用于解释Docker的真实图片
- “嘿，伙计，我想在虚拟海洋中运行我的网站”
- 他们把我们当傻子耍了

这段文本是对软件开发中使用Docker的讽刺评论，通过幽默和夸张的手法质疑了这种做法。

📊 基准测试

模型	VQA v2	MMBench	MMMU (Eval)	MathVista	RealWorldQA
Dolphin-Vision-72B	83.6	81.2	45.7	47.25	66.4
GPT-4V	84.4	78.1	52.4	-	67.9

📄 许可证

许可证类型：other
许可证名称：tongyi-qianwen

📚 详细文档

模型信息

属性	详情
基础模型	Qwen/Qwen2-72B
模型标签	generated_from_trainer、axolotl
训练数据集	cognitivecomputations/Dolphin-2.9、teknium/OpenHermes-2.5、m-a-p/CodeFeedback-Filtered-Instruction、cognitivecomputations/dolphin-coder、cognitivecomputations/samantha-data、microsoft/orca-math-word-problems-200k、Locutusque/function-calling-chatml、internlm/Agent-FLAN