amoral-gemma3-12B-vision开源模型 - 支持多模态任务的视觉增强大语言工具

首页

Amoral Gemma3 12B Vision

由 gghfez 开发

基于soob3123/amoral-gemma3-12B的视觉增强版本，结合了Gemma3-12B大语言模型与视觉编码器，支持多模态任务

图像生成文本

Transformers

英语#多模态视觉理解 #高精度图像描述 #自然语言生成

下载量 25

发布时间 : 3/21/2025

模型简介

这是一个多模态模型，能够处理图像和文本输入，生成详细的图像描述或回答相关问题。相比基础Gemma3-12B模型，在视觉理解方面表现更优

模型特点

多模态能力

同时处理图像和文本输入，实现跨模态理解

详细图像描述

相比基础Gemma3-12B模型，能生成更丰富、更准确的图像描述

高效推理

支持设备自动映射(device_map)和bfloat16精度，优化推理效率

模型能力

图像理解

图像描述生成

视觉问答

多模态对话

使用案例

内容分析

图像描述生成

为上传的图片生成详细文字描述

输出包含物体、场景、颜色、光线等要素的丰富描述

辅助工具

视觉辅助

帮助视障人士理解图像内容

提供准确、详细的场景描述

🚀 gghfez/amoral-gemma3-12B-vision

本项目是在soob3123/amoral-gemma3-12B的基础上重新连接了视觉编码器，可用于图像相关的推理任务。

🚀 快速开始

本项目基于transformers库，使用soob3123/amoral-gemma3-12B作为基础模型，许可证为gemma。以下是相关信息表格：

属性	详情
基础模型	soob3123/amoral-gemma3-12B
语言	en
库名称	transformers
许可证	gemma
标签	transformers、gemma3、gemma、google

💻 使用示例

基础用法

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "gghfez/amoral-gemma3-12B-vision"
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=500, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)