Pixtral-12B开源多模态模型 - 免费部署实现图像理解与描述任务

首页

Pixtral 12b

由 mgoin 开发

Pixtral-12B 是一个与 transformers 库兼容的多模态模型，能够处理图像和文本输入并生成文本输出，适用于图像理解和描述任务。

图像生成文本

Transformers

#多图像理解 #图文对话生成 #大语言模型集成

下载量 1,943

发布时间 : 10/18/2024

模型简介

Pixtral-12B 是一个基于 Mistral 架构的多模态模型，支持图像和文本的联合处理，能够生成高质量的图像描述和回答相关问题。

模型特点

多模态处理

能够同时处理图像和文本输入，生成连贯的文本输出。

高质量图像描述

能够生成详细且准确的图像描述，包括场景、物体和情感分析。

聊天模板支持

支持使用聊天模板格式化聊天历史记录，便于多轮对话。

模型能力

图像描述

多模态问答

场景分析

物体识别

使用案例

图像理解

图像描述生成

输入一张或多张图像，模型生成详细的描述文本。

生成包含场景、物体和情感分析的详细描述。

多模态问答

结合图像和文本提问，模型生成相关回答。

能够根据图像内容回答相关问题，提供上下文相关的信息。

自然语言处理

聊天机器人

支持多轮对话，结合图像和文本进行交互。

生成连贯且上下文相关的回答。

🚀 图像文本转文本模型 `pixtral`

pixtral 是与 transformers 库兼容的模型检查点。它能够处理图像和文本输入，并生成相应的文本输出，为图像理解和描述任务提供了强大的支持。

🚀 快速开始

在使用 pixtral 模型之前，请确保从源代码安装 transformers 库，或者等待 v4.45 版本发布。

基础用法

from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

IMG_URLS = [
"https://picsum.photos/id/237/400/300", 
"https://picsum.photos/id/231/200/300", 
"https://picsum.photos/id/27/500/500",
"https://picsum.photos/id/17/150/600",
]
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"

inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

运行上述代码后，你应该会得到类似以下的输出：

"""
Describe the images.
Sure, let's break down each image description:

1. **Image 1:**
   - **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
   - **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.

2. **Image 2:**
   - **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
   - **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.

3. **Image 3:**
   - **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
   - **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.

4. **Image 4:**
   - **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
   - **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.

Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
"""

高级用法

你还可以使用聊天模板来格式化 Pixtral 的聊天历史记录。确保 processor 的 images 参数包含的图像顺序与聊天中出现的顺序一致，以便模型理解每个图像的位置。

from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"

chat = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "Can this animal"}, 
        {"type": "image"}, 
        {"type": "text", "content": "live here?"}, 
        {"type": "image"}
      ]
    }
]

prompt = processor.apply_chat_template(chat)
inputs = processor(text=prompt, images=[url_dog, url_mountain], return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

运行上述代码后，你应该会得到类似以下的输出：

Can this animallive here?Certainly! Here are some details about the images you provided:

### First Image
- **Description**: The image shows a black dog lying on a wooden surface. The dog has a curious expression with its head tilted slightly to one side.
- **Details**: The dog appears to be a young puppy with soft, shiny fur. Its eyes are wide and alert, and it has a playful demeanor.
- **Context**: This image could be used to illustrate a pet-friendly environment or to showcase the dog's personality.

### Second Image
- **Description**: The image depicts a serene landscape with a snow-covered hill in the foreground. The sky is painted with soft hues of pink, orange, and purple, indicating a sunrise or sunset.
- **Details**: The hill is covered in a blanket of pristine white snow, and the horizon meets the sky in a gentle curve. The scene is calm and peaceful.
- **Context**: This image could be used to represent tranquility, natural beauty, or a winter wonderland.

### Combined Context
If you're asking whether the dog can "live here," referring to the snowy landscape, it would depend on the breed and its tolerance to cold weather. Some breeds, like Huskies or Saint Bernards, are well-adapted to cold environments, while others might struggle. The dog in the first image appears to be a breed that might prefer warmer climates.

Would you like more information on any specific aspect?