AM-RADIO:万域归一
迈克·兰辛格、格雷格·海因里希、扬·考茨、帕夫洛·莫尔恰诺夫
NVIDIA研究院
[AM-RADIO论文]
[PHI-S论文]
[BibTex引用][GitHub示例]
[v2.5技术报告]
HuggingFace模型库
可通过Python脚本加载模型:
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
hf_repo = "nvidia/RADIO-B"
image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()
image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()
summary, features = model(pixel_values)
使用说明
RADIO将返回包含两个张量的元组:
summary
:类似于ViT中的cls_token
,表征图像整体概念,形状为$(B,C)$,$B$为批次维度,$C$为通道数
spatial_features
:表征局部内容,适用于语义分割等密集任务或LLM集成,形状为$(B,T,D)$,$T$为展平的空间标记,$D$为空间特征通道数(注意$C$与$D$通常不等)
转换为空间张量格式可通过模型下采样尺寸与输入张量形状实现。对于'radio_v1',补丁尺寸为14:
from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)
结果张量将呈现计算机视觉模型常见的$(B,D,H,W)$形状。
RADIOv2.5说明
详见RADIOv2.5技术报告。
许可协议
RADIO代码与权重遵循NSCLv1许可协议。
引用RADIO
若使用本仓库,请考虑星标并引用:
@InProceedings{Ranzinger_2024_CVPR,
author = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
title = {AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {12490-12500}
}
@misc{ranzinger2024phisdistributionbalancinglabelfree,
title={PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation},
author={Mike Ranzinger and Jon Barker and Greg Heinrich and Pavlo Molchanov and Bryan Catanzaro and Andrew Tao},
year={2024},
eprint={2410.01680},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.01680},
}