库名称: transformers
许可证: apple-amlr
评估指标:
- 准确率
任务标签: 图像特征提取
标签:
- 视觉
- 图像特征提取
- mlx
- pytorch
简介
[AIMv2论文
] [BibTeX
]
我们推出了采用多模态自回归目标预训练的AIMv2系列视觉模型。AIMv2预训练方法简单直接,能有效扩展训练规模。AIMv2的主要亮点包括:
- 在多数多模态理解基准测试中超越OAI CLIP和SigLIP
- 在开放词汇目标检测和指代表达理解任务上优于DINOv2
- 展现强大识别性能,AIMv2-3B模型在冻结主干网络情况下实现ImageNet 89.5%准确率
使用方法
PyTorch
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained(
"apple/aimv2-large-patch14-336-distilled",
)
model = AutoModel.from_pretrained(
"apple/aimv2-large-patch14-336-distilled",
trust_remote_code=True,
)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
JAX
import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained(
"apple/aimv2-large-patch14-336-distilled",
)
model = FlaxAutoModel.from_pretrained(
"apple/aimv2-large-patch14-336-distilled",
trust_remote_code=True,
)
inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)
引用说明
如果您认为我们的工作有价值,请考虑引用:
@misc{fini2024multimodalautoregressivepretraininglarge,
author = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
url = {https://arxiv.org/abs/2411.14402},
eprint = {2411.14402},
eprintclass = {cs.CV},
eprinttype = {arXiv},
title = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
year = {2024},
}