开源clip-vit-base-patch32_lego-brick模型 - 精准识别乐高积木及对应描述

首页

Clip Vit Base Patch32 Lego Brick

由 armaggheddon97 开发

基于CLIP模型微调的乐高积木图像-文本匹配模型，专为识别乐高积木及其描述设计。

文本生成图像

Transformers

英语开源协议:MIT #乐高积木识别 #零样本分类 #高精度匹配

下载量 44

发布时间 : 1/24/2025

模型简介

此模型是在乐高积木描述数据集上微调的CLIP模型，用于准确匹配乐高积木图像与其对应的文本描述，帮助用户通过描述或图片找到特定积木。

模型特点

高精度匹配

模型经过微调，能够以高置信度准确匹配乐高积木图像与文本描述。

零样本分类

支持零样本图像分类，无需额外训练即可对新类别进行分类。

多模态处理

同时处理图像和文本输入，生成对应的嵌入向量。

模型能力

图像分类

文本-图像匹配

生成图像嵌入

生成文本嵌入

使用案例

乐高积木识别

积木搜索

通过文本描述或上传图片查找特定乐高积木。

模型能够以高置信度返回最匹配的积木结果。

零样本分类

对新的乐高积木类别进行分类，无需额外训练。

在测试数据集上准确率达到99.23%。

🚀 clip-vit-base-patch32_lego-brick模型

本模型是一个基于CLIP（对比语言-图像预训练）架构的模型，专门用于将乐高积木的图像与对应的文字描述进行匹配，能有效解决乐高爱好者在识别积木时的难题，提升积木识别的准确性和效率。

🚀 快速开始

本模型是openai/clip-vit-base-patch32 CLIP（对比语言-图像预训练）模型在lego_brick_captions数据集上的微调版本，专门用于将乐高积木的图像与对应的文字描述进行匹配。

⚠️ 重要提示

如果你对使用的代码感兴趣，请参考我GitHub上的微调脚本。

✨ 主要特性

还在为搞不清某个难以捉摸的乐高积木的名字而苦恼吗？或者你只有一个模糊的概念或一张图片，但却不知道确切的零件编号？BricksFinder就能帮你解决这些问题！

你只需输入像“蓝色弯曲斜坡”这样的描述，或者上传一块积木的图片，我们的模型就会发挥它的魔力，找到最匹配的结果。它会为你展示一系列看起来和你所想的积木一模一样的图片，甚至可能更好！

Web UI

这个模型非常适合乐高爱好者、积木搭建者，或者任何喜欢在积木中寻宝的人。你可以点击下面的链接，在Colab上进行实时演示并尝试一下：

📦 安装指南

使用🤗 transformers加载模型

使用以下代码片段加载模型和处理器：

import torch
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

使用Auto类：

from transformers import AutoModelForZeroShotImageClassification, AutoProcessor

model = AutoModelForZeroShotImageClassification.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
processor = AutoProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

使用pipeline：

from transformers import pipeline

model = "armaggheddon97/clip-vit-base-patch32_lego-brick"
clip_classifier = pipeline("zero-shot-image-classification", model=model)

以float16精度加载模型

提供的模型是float32精度的。若要以float16精度加载模型以加快推理速度，可以使用以下代码片段：

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", dtype=torch.float16)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

或者直接使用torch：

import torch
from transformers import CLIPModel

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")
model_fp16 = model.to(torch.float16)

💻 使用示例

基础用法

生成嵌入

仅嵌入文本：

import torch
from transformers import CLIPTokenizerFast, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
tokenizer = CLIPTokenizerFast.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick")

text = ["a photo of a lego brick"]
tokens = tokenizer(text, return_tensors="pt", padding=True).to(device)
outputs = model.get_text_features(**tokens)

仅嵌入图像：

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

image = Image.open("path_to_image.jpg")
inputs = processor(images=image, return_tensors="pt").to(device)
outputs = model.get_image_features(**inputs)

零样本图像分类

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)
processor = CLIPProcessor.from_pretrained("armaggheddon97/clip-vit-base-patch32_lego-brick", device_map="auto").to(device)

dataset = load_dataset("armaggheddon97/lego_brick_captions", split="test")

captions = [
    "a photo of a lego brick with a 2x2 plate",
    "a photo of gray minifigure legs",
    "a photo of a brick with a curved slope",
]
image = dataset[0]["image"]

inputs = processor(text=captions, images=image, return_tensors="pt", padding=True).to(device)
outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probabilities = logits_per_image.softmax(dim=1)
max_prob_idx = torch.argmax(logits_per_image, dim=1)

📚 详细文档

模型描述

开发者：基础模型由OpenAI开发，微调模型由armaggheddon97开发。
模型类型：该模型是一个CLIP（对比语言-图像预训练）模型。
语言：该模型期望输入英文文本。
许可证：该模型遵循MIT许可证。
基于clip-vit-base-patch32微调：该模型是openai/clip-vit-base-patch32模型在lego_brick_captions数据集上的微调版本。模型在该数据集80-20的训练-验证分割上进行了7个epoch的微调。有关微调脚本的更多详细信息，请查看我GitHub上的代码。

结果

目标是获得一个能够根据文字描述更准确地区分积木图像的模型。在这方面，就准确性而言，两个模型的表现相似。然而，当使用零样本图像分类部分中的代码进行分类任务测试时，微调后的模型能够以更高的置信度更准确地对图像进行分类。例如，当使用以下输入测试模型时：

A sand green 2x2 minifigure legs piece with two axle holes on top. The legs feature a printed design depicting wrapped fabric, in shades of light grey, orange, and beige. The piece is solid and has no additional connection points besides the axle holes.
A medium-green 1x1 round minifigure head features a printed design: two yellow eyes, pink floral elements, and a toothy grin. It has a standard top stud for attachment, and no other connection points are visible. The printed details are detailed and cover a majority of the surface.
A white 2x2 brick with four studs, each imprinted with the LEGO logo. The brick is a standard 2x2 size, with no additional holes or features. The color is a bright, slightly off-white

并使用以下图像作为输入：

微调后的模型输出：

100.00%："A sand green 2x2 minifigure legs piece with two axle holes on top. The legs feature a printed design depicting wrapped fabric, in shades of light grey, orange, and beige. The piece is solid and has no additional connection points besides the axle holes."
0.00%："A medium-green 1x1 round minifigure head features a printed design: two yellow eyes, pink floral elements, and a toothy grin. It has a standard top stud for attachment, and no other connection points are visible. The printed details are detailed and cover a majority of the surface."
0.00%："A white 2x2 brick with four studs, each imprinted with the LEGO logo. The brick is a standard 2x2 size, with no additional holes or features. The color is a bright, slightly off-white"

而基础模型对于相同的输入给出：

98.7%："A sand green 2x2 minifigure legs piece with two axle holes on top. The legs feature a printed design depicting wrapped fabric, in shades of light grey, orange, and beige. The piece is solid and has no additional connection points besides the axle holes."
1.24%："A medium-green 1x1 round minifigure head features a printed design: two yellow eyes, pink floral elements, and a toothy grin. It has a standard top stud for attachment, and no other connection points are visible. The printed details are detailed and cover a majority of the surface."
0.00%："A white 2x2 brick with four studs, each imprinted with the LEGO logo. The brick is a standard 2x2 size, with no additional holes or features. The color is a bright, slightly off-white"

这表明微调后的模型能够根据文字描述准确地对图像进行分类。然而，基础模型也能够正确地对图像进行分类，只是置信度略低。

在整个数据集上运行相同的任务，每个样本有1个正确的描述（始终是第一个）和2个随机采样的描述，得到以下指标： results

该图可视化了微调模型和基础模型产生的归一化文本对数几率：

输入：对于每个样本，选取一张乐高积木的图像以及三个描述：
- 正确描述：与图像匹配的描述（位置0）。
- 两个随机采样的错误描述（位置1和2）。
输出：模型为每个描述生成文本对数几率，反映图像嵌入与每个描述嵌入之间的相似度。然后对这些对数几率进行归一化处理，以便于可视化。
热力图可视化：归一化后的对数几率以热力图的形式显示，其中：
- 每个行代表一个不同的输入样本。
- 每个列代表三个描述之一：正确描述（0，第一行），以及两个随机描述（1和2，第二行和第三行）。
- 颜色强度代表模型为每个描述分配的归一化对数几率得分，颜色越深表示得分越高，置信度越高（即第一行与第二行和第三行之间的对比度越大，结果越好）。

基础模型（右侧），正如预期的那样，在任何类别中都没有显示出高置信度，对图像和文本样本的区分能力较差，标签得分之间的差异也小得多。然而，就准确性而言，它仍然能够在97.46%的样本上正确分配正确的描述。

微调模型（左侧）在正确描述上显示出更高的置信度，能够清晰地区分正确描述和错误描述。这体现在为正确描述分配的得分更高，为错误描述分配的得分更低。就准确性而言，微调模型的结果相似，但略低于基础模型，准确率为99.23%。

在`short_caption`上微调

作为一个实验，模型还在数据集的short_caption列上进行了微调。并使用与之前相同的方法，将其与在caption列上微调的基础模型进行了比较。使用相同的样本图像和short_caption中的标签，结果如下：

在short_caption上微调的模型：

100.00%：" Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
0.00% (2.32e-21)：" Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
0.00% (5.91e-18)："Brick 2 x 2 without Inside Ridges"

在caption上微调的模型：

100.00% (1)：" Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
0.00% (3.38e-14)：" Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
0.00% (2.9e-8)："Brick 2 x 2 without Inside Ridges"

基础模型：

0.00%：" Hips and Dark Tan Legs with Robe and Dark Orange Strap Print"
22.07%：" Minifig Head Slizer, Yellow Eyes, Pointed Teeth and Bubbles Print [Blocked Open Stud]"
77.79%："Brick 2 x 2 without Inside Ridges"

尽管在short_caption列上进行了微调，但与在caption列上微调的模型相比，结果非常相似。两者之间的唯一区别是正确描述和错误描述之间的值跨度更大。在这种情况下，基础模型的表现明显比使用caption列进行分类时差，并且还分配了错误的描述。

在整个数据集上运行相同的任务，选择一个正确描述和2个随机描述，结果如下：

比较在short_caption和caption上微调的模型，得到以下结果： results 在short_caption列上微调的模型的准确率为99.99%，而在caption列上微调的模型的准确率为98.48%。

虽然在short_caption列上微调的模型更准确，但两者之间的权衡在于对正确描述的置信度。由于在caption列上微调的模型在文本搜索方面具有更大的灵活性，因此这里上传的是该模型。

基础模型在遍历整个数据集时的表现与之前相似，整体准确率仍约为97%。这也表明，所选样本可能是基础模型的一个异常值，因为它能够正确分类大多数其他图像-文本对。

🔧 技术细节

本模型基于CLIP架构，通过在lego_brick_captions数据集上进行微调，学习到了乐高积木图像与文字描述之间的关联。微调过程中，模型在数据集80-20的训练-验证分割上进行了7个epoch的训练，以提高其对乐高积木图像的分类能力。在推理阶段，模型能够根据输入的图像和文字描述，计算它们之间的相似度，并输出最匹配的结果。