标签:
- 图像特征提取
- 鸟类识别
- PyTorch
库名称: birder
许可证: apache-2.0
基础模型:
- birder-project/rope_vit_reg4_b14_capi
数据集:
- timm/imagenet-w21-webp-wds
rope_vit_reg4_b14_capi-imagenet21k模型卡
一个采用RoPE的ViT图像分类模型。该模型采用两阶段训练流程:先进行CAPI预训练,然后在ImageNet-21K
数据集上微调。
RoPE配置说明
本模型实现了EVA风格的旋转位置编码(RoPE)。当处理与训练分辨率(224x224)不同的输入时,可通过配置pt_grid_size
参数优化模型表现:
- 进行高分辨率推理或"浅层"微调时,建议显式设置
pt_grid_size=(16, 16)
(预训练时的默认网格尺寸)
- 进行高分辨率激进微调时,保持
pt_grid_size
为None
以让模型适应新分辨率
推理时设置pt_grid_size
:
python predict.py --network rope_vit_reg4_b14 -t capi-imagenet21k --model-config '{"pt_grid_size":[16, 16]}' --size 336 ...
转换模型时显式配置RoPE:
python tool.py convert-model --network rope_vit_reg4_b14 -t capi-imagenet21k --add-config '{"pt_grid_size":[16, 16]}'
模型详情
模型使用
图像分类
import birder
from birder.inference.classification import infer_image
(net, model_info) = birder.load_pretrained_model("rope_vit_reg4_b14_capi-imagenet21k", inference=True)
size = birder.get_size_from_signature(model_info.signature)
transform = birder.classification_transform(size, model_info.rgb_stats)
image = "path/to/image.jpeg"
(out, _) = infer_image(net, image, transform)
图像嵌入
import birder
from birder.inference.classification import infer_image
(net, model_info) = birder.load_pretrained_model("rope_vit_reg4_b14_capi-imagenet21k", inference=True)
size = birder.get_size_from_signature(model_info.signature)
transform = birder.classification_transform(size, model_info.rgb_stats)
image = "path/to/image.jpeg"
(out, embedding) = infer_image(net, image, transform, return_embedding=True)
检测特征图
from PIL import Image
import birder
(net, model_info) = birder.load_pretrained_model("rope_vit_reg4_b14_capi-imagenet21k", inference=True)
size = birder.get_size_from_signature(model_info.signature)
transform = birder.classification_transform(size, model_info.rgb_stats)
image = Image.open("path/to/image.jpeg")
features = net.detection_features(transform(image).unsqueeze(0))
print([(k, v.size()) for k, v in features.items()])
引用文献
@misc{dosovitskiy2021imageworth16x16words,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
year={2021},
eprint={2010.11929},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2010.11929},
}
@misc{heo2024rotarypositionembeddingvision,
title={Rotary Position Embedding for Vision Transformer},
author={Byeongho Heo and Song Park and Dongyoon Han and Sangdoo Yun},
year={2024},
eprint={2403.13298},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2403.13298},
}
@misc{darcet2024visiontransformersneedregisters,
title={Vision Transformers Need Registers},
author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
year={2024},
eprint={2309.16588},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2309.16588},
}
@misc{darcet2025clusterpredictlatentpatches,
title={Cluster and Predict Latent Patches for Improved Masked Image Modeling},
author={Timothée Darcet and Federico Baldassarre and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
year={2025},
eprint={2502.08769},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.08769},
}