许可协议:apache-2.0
标签:
- 目标检测
- 视觉
数据集:
- coco
示例展示:
- 图片地址:https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
示例标题:热带草原
- 图片地址:https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
示例标题:足球比赛
- 图片地址:https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
示例标题:机场
YOLOS(微型)模型
基于COCO 2017目标检测数据集(11.8万张标注图像)微调的YOLOS模型。该模型由Fang等人在论文《You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection》中提出,并首次发布于此代码库。
免责声明:发布YOLOS的团队未为此模型编写说明卡片,本卡片由Hugging Face团队撰写。
模型描述
YOLOS是一种使用DETR损失训练的视觉Transformer(ViT)。尽管结构简洁,基础尺寸的YOLOS模型在COCO 2017验证集上能达到42 AP(与DETR及更复杂的框架如Faster R-CNN相当)。
模型训练采用「二分匹配损失」:将N=100个物体查询的预测类别+边界框与填充至相同长度N的真实标注进行比对(若图像仅含4个物体,则剩余96个标注的类别为「无物体」,边界框为「无边界框」)。通过匈牙利匹配算法建立查询与标注间的最优一对一映射,再结合标准交叉熵损失(类别)与L1及广义IoU损失的线性组合(边界框)优化模型参数。
用途与限制
可直接用于目标检测任务。查看模型库获取全部可用YOLOS模型。
使用方法
from transformers import YolosImageProcessor, YolosForObjectDetection
from PIL import Image
import torch
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
model = YolosForObjectDetection.from_pretrained('hustvl/yolos-tiny')
image_processor = YolosImageProcessor.from_pretrained("hustvl/yolos-tiny")
inputs = image_processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
bboxes = outputs.pred_boxes
target_sizes = torch.tensor([image.size[::-1]])
results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
box = [round(i, 2) for i in box.tolist()]
print(
f"检测到 {model.config.id2label[label.item()]},置信度 "
f"{round(score.item(), 3)},位置 {box}"
)
当前特征提取器与模型仅支持PyTorch。
训练数据
YOLOS模型先在ImageNet-1k上预训练,再于COCO 2017目标检测数据集(含训练集11.8万张/验证集5千张标注图像)上微调。
训练过程
模型在ImageNet-1k上预训练300轮,在COCO上微调300轮。
评估结果
本模型在COCO 2017验证集上达到28.7 AP(平均精度)。详细评估结果请参阅原论文。
BibTeX引用信息
@article{DBLP:journals/corr/abs-2106-00666,
author = {Yuxin Fang and
Bencheng Liao and
Xinggang Wang and
Jiemin Fang and
Jiyang Qi and
Rui Wu and
Jianwei Niu and
Wenyu Liu},
title = {You Only Look at One Sequence: Rethinking Transformer in Vision through
Object Detection},
journal = {CoRR},
volume = {abs/2106.00666},
year = {2021},
url = {https://arxiv.org/abs/2106.00666},
eprinttype = {arXiv},
eprint = {2106.00666},
timestamp = {Fri, 29 Apr 2022 19:49:16 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2106-00666.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}