标签:
- timm
- 图像特征提取
- transformers
库名称: timm
许可证: mit
数据集:
- laion-en
- laion-zh
- coyo
- grit
- coco
- textcaps
- objects365
- openimages
- all-seeing
- wukong-ocr
- laioncoco-ocr
- other-ocr
模型卡片 vit_intern300m_patch14_448.ogvl_dist
一个InternViT图像特征模型。通过论文作者从InternViT-6B蒸馏预训练,使用了多种图像-文本数据。模型权重已从原始版本转换为timm
的vit模型,来源为OpenGVLab/InternViT-300M-448px。注意:此vit在特征/头部之前没有最终的归一化层。
模型详情
- 模型类型: 图像分类/特征骨干网络
- 模型统计:
- 参数量(百万): 304.0
- GMACs: 362.0
- 激活量(百万): 656.4
- 图像尺寸: 448 x 448
- 论文:
- InternVL2: 比最好更好: https://internvl.github.io/blog/2024-07-02-InternVL-2.0/
- InternVL: 扩展视觉基础模型并对齐通用视觉-语言任务: https://arxiv.org/abs/2312.14238
- 原始代码: https://github.com/OpenGVLab/InternVL
- 数据集:
- LAION-en
- LAION-zh
- COYO
- GRIT
- COCO
- TextCaps
- Objects365
- OpenImages
- All-Seeing
- Wukong-OCR
- LaionCOCO-OCR
- other-OCR
模型使用
图像分类
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model('vit_intern300m_patch14_448.ogvl_dist', pretrained=True)
model = model.eval()
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0))
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
特征图提取
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'vit_intern300m_patch14_448.ogvl_dist',
pretrained=True,
features_only=True,
)
model = model.eval()
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0))
for o in output:
print(o.shape)
图像嵌入
from urllib.request import urlopen
from PIL import Image
import timm
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model(
'vit_intern300m_patch14_448.ogvl_dist',
pretrained=True,
num_classes=0,
)
model = model.eval()
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0))
output = model.forward_features(transforms(img).unsqueeze(0))
output = model.forward_head(output, pre_logits=True)
引用
@article{chen2023internvl,
title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2312.14238},
year={2023}
}
@article{chen2023internvl,
title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2312.14238},
year={2023}
}