sam-hq-vit-large开源模型 - 从点框提示生成高质量对象掩码图像

首页

Sam Hq Vit Large

由 syscv-community 开发

SAM-HQ是Segment Anything Model（SAM）的增强版本，能够从点或框等输入提示生成更高质量的对象掩码。

图像分割

Transformers

开源协议:Apache-2.0 #高质量图像分割 #零样本泛化 #复杂边界处理

下载量 60

发布时间 : 5/5/2025

模型简介

SAM-HQ通过引入高质量输出令牌和全局-局部特征融合组件，显著提升了分割掩码的质量，特别适用于复杂边界和细微结构对象的分割。

模型特点

高质量输出令牌

引入可学习的HQ输出令牌，专门用于预测高质量掩码，显著提升分割精度。

全局-局部特征融合

结合早期和最终的ViT特征，融合高级语义上下文和低级边界信息，改善掩码细节。

高效训练

仅需8个GPU上4小时训练，相比原版SAM增加不到0.5%的参数。

零样本泛化

保留SAM原有的零样本泛化能力，同时在10个数据集上表现更优。

模型能力

高质量图像分割

基于提示的掩码生成

自动掩码生成

复杂边界处理

细微结构识别

使用案例

图像编辑

精确对象分割

用于图像编辑软件中精确分离对象

生成比原版SAM更精细的掩码边界

自动化标注

数据标注辅助

自动生成训练数据的分割标注

减少人工标注工作量，提高标注质量

🚀 高质量分割一切模型（SAM - HQ）

SAM - HQ 是分割一切模型（SAM）的增强版本，它能根据点或框等输入提示生成更高质量的对象掩码。在处理复杂结构的对象时，SAM - HQ 显著提升了掩码质量，同时保留了 SAM 原有的可提示设计、效率和零样本泛化能力。

🚀 快速开始

环境准备

确保你已经安装了所需的库，如 transformers、Pillow、requests、matplotlib、torch 等。可以使用以下命令安装：

pip install transformers pillow requests matplotlib torch

运行示例

下面是一个简单的示例，展示了如何使用 SAM - HQ 进行掩码生成：

from PIL import Image
import requests
from transformers import SamHQModel, SamHQProcessor

model = SamHQModel.from_pretrained("syscv-community/sam-hq-vit-large")
processor = SamHQProcessor.from_pretrained("syscv-community/sam-hq-vit-large")

img_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_boxes = [[[306, 132, 925, 893]]]  # Bounding box for the image

inputs = processor(raw_image, input_boxes=input_boxes, return_tensors="pt").to("cuda")
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
scores = outputs.iou_scores

✨ 主要特性

高质量输出：能够生成高质量的分割掩码，即使对于具有复杂边界和细结构的对象，原 SAM 模型往往难以处理，而 SAM - HQ 表现出色。
创新架构：在原 SAM 架构基础上进行了两项关键创新，即高质量输出令牌和全局 - 局部特征融合，同时保留了 SAM 的预训练权重。
高效训练：在 8 个 GPU 上仅需 4 小时的训练时间，与原 SAM 模型相比，引入的额外参数不到 0.5%。
零样本泛化：保持了 SAM 原有的可提示设计、效率和零样本泛化能力。

📚 详细文档

模型细节

SAM - HQ 在保留 SAM 预训练权重的基础上，对原 SAM 架构进行了两项关键创新：

高质量输出令牌：一个可学习的令牌被注入到 SAM 的掩码解码器中，负责预测高质量的掩码。与 SAM 原有的输出令牌不同，这个令牌及其相关的 MLP 层经过专门训练，以生成高度准确的分割掩码。
全局 - 局部特征融合：SAM - HQ 不是仅在掩码解码器特征上应用高质量输出令牌，而是首先将这些特征与早期和最终的 ViT 特征进行融合，以改善掩码细节。这结合了高级语义上下文和低级边界信息，实现更准确的分割。

SAM - HQ 在一个精心策划的 44K 细粒度掩码数据集（HQSeg - 44K）上进行训练，该数据集来自多个来源，具有极其准确的注释。

评估结果

该模型在 10 个不同的分割数据集上进行了评估，涵盖了不同的下游任务，其中 8 个数据集采用零样本迁移协议进行评估。结果表明，SAM - HQ 能够生成比原 SAM 模型显著更好的掩码，同时保持其零样本泛化能力。

解决的问题

SAM - HQ 解决了原 SAM 模型的两个关键问题：

粗糙的掩码边界，常常忽略薄对象结构。
在具有挑战性的情况下，预测错误、掩码破碎或存在较大误差。

这些改进使得 SAM - HQ 对于需要高精度图像掩码的应用特别有价值，如自动注释和图像/视频编辑任务。

💻 使用示例

基础用法

提示掩码生成

from PIL import Image
import requests
from transformers import SamHQModel, SamHQProcessor

model = SamHQModel.from_pretrained("syscv-community/sam-hq-vit-large")
processor = SamHQProcessor.from_pretrained("syscv-community/sam-hq-vit-large")

img_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_boxes = [[[306, 132, 925, 893]]]  # Bounding box for the image

inputs = processor(raw_image, input_boxes=input_boxes, return_tensors="pt").to("cuda")
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
scores = outputs.iou_scores

自动掩码生成

from transformers import pipeline
generator = pipeline("mask-generation", model="syscv-community/sam-hq-vit-large", device=0, points_per_batch=256)
image_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
outputs = generator(image_url, points_per_batch=256)

import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)
    
plt.imshow(np.array(raw_image))
ax = plt.gca()
for mask in outputs["masks"]:
    show_mask(mask, ax=ax, random_color=True)
plt.axis("off")
plt.show()

高级用法

完整示例及可视化

import numpy as np
import matplotlib.pyplot as plt
def show_mask(mask, ax, random_color=False):
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30/255, 144/255, 255/255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)
def show_box(box, ax):
    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))  
def show_boxes_on_image(raw_image, boxes):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_on_image(raw_image, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_and_boxes_on_image(raw_image, boxes, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points_and_boxes_on_image(raw_image, boxes, input_points, input_labels=None):
    plt.figure(figsize=(10,10))
    plt.imshow(raw_image)
    input_points = np.array(input_points)
    if input_labels is None:
      labels = np.ones_like(input_points[:, 0])
    else:
      labels = np.array(input_labels)
    show_points(input_points, labels, plt.gca())
    for box in boxes:
      show_box(box, plt.gca())
    plt.axis('on')
    plt.show()
def show_points(coords, labels, ax, marker_size=375):
    pos_points = coords[labels==1]
    neg_points = coords[labels==0]
    ax.scatter(pos_points[:, 0], pos_points[:, 1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)
    ax.scatter(neg_points[:, 0], neg_points[:, 1], color='red', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)
def show_masks_on_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
      masks = masks.squeeze()
    if scores.shape[0] == 1:
      scores = scores.squeeze()
    nb_predictions = scores.shape[-1]
    fig, axes = plt.subplots(1, nb_predictions, figsize=(15, 15))
    for i, (mask, score) in enumerate(zip(masks, scores)):
      mask = mask.cpu().detach()
      axes[i].imshow(np.array(raw_image))
      show_mask(mask, axes[i])
      axes[i].title.set_text(f"Mask {i+1}, Score: {score.item():.3f}")
      axes[i].axis("off")
    plt.show()
def show_masks_on_single_image(raw_image, masks, scores):
    if len(masks.shape) == 4:
        masks = masks.squeeze()
    if scores.shape[0] == 1:
        scores = scores.squeeze()
    # Convert image to numpy array if it's not already
    image_np = np.array(raw_image)
    # Create a figure
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.imshow(image_np)
    # Overlay all masks on the same image
    for i, (mask, score) in enumerate(zip(masks, scores)):
        mask = mask.cpu().detach().numpy()  # Convert to NumPy
        show_mask(mask, ax)  # Assuming `show_mask` properly overlays the mask
    ax.set_title(f"Overlayed Masks with Scores")
    ax.axis("off")
    plt.show()

import torch
from transformers import SamHQModel, SamHQProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SamHQModel.from_pretrained("syscv-community/sam-hq-vit-large").to(device)
processor = SamHQProcessor.from_pretrained("syscv-community/sam-hq-vit-large")

from PIL import Image
import requests
img_url = "https://raw.githubusercontent.com/SysCV/sam-hq/refs/heads/main/demo/input_imgs/example1.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
plt.imshow(raw_image)

inputs = processor(raw_image, return_tensors="pt").to(device)
image_embeddings, intermediate_embeddings = model.get_image_embeddings(inputs["pixel_values"])

input_boxes = [[[306, 132, 925, 893]]]
show_boxes_on_image(raw_image, input_boxes[0]) 

inputs.pop("pixel_values", None)
inputs.update({"image_embeddings": image_embeddings})
inputs.update({"intermediate_embeddings": intermediate_embeddings})
with torch.no_grad():
    outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
scores = outputs.iou_scores

show_masks_on_single_image(raw_image, masks[0], scores)

show_masks_on_image(raw_image, masks[0], scores)

📄 许可证

本项目采用 Apache - 2.0 许可证。

📜 引用

如果你在研究中使用了该模型，请使用以下 BibTeX 引用：

@misc{ke2023segmenthighquality,
      title={Segment Anything in High Quality}, 
      author={Lei Ke and Mingqiao Ye and Martin Danelljan and Yifan Liu and Yu-Wing Tai and Chi-Keung Tang and Fisher Yu},
      year={2023},
      eprint={2306.01567},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2306.01567}, 
}