PE-Core-G14-448开源图像与视频理解编码器

首页

PE Core G14 448

由 facebook 开发

感知编码器（PE）是通过简单视觉-语言学习训练出的最先进的图像与视频理解编码器，在多种视觉任务上均达到最先进性能。

文本生成图像开源协议:Apache-2.0 #零样本视觉理解 #多模态对比学习 #高精度图像分类

下载量 22.83k

发布时间 : 4/11/2025

模型简介

感知编码器（PE）是一系列大规模视觉编码器模型，采用鲁棒的对比预训练方案并在合成对齐视频上微调，不仅在分类和检索任务上超越现有所有模型，其内部还能生成适用于下游任务的强通用特征。

模型特点

强大的零样本能力

在零样本图像分类/检索以及零样本视频分类/检索任务中均取得极强性能

内部特征通用性强

模型内部能生成适用于多种下游任务的强通用特征

困难基准表现突出

在ObjectNet和ImageNet-A等困难基准测试中表现尤为突出

模型能力

零样本图像分类

零样本图像检索

零样本视频分类

零样本视频检索

视觉特征提取

文本特征提取

使用案例

图像理解

图像分类

无需微调即可对新图像进行分类

在ImageNet-1k上达到85.4%准确率

图像检索

根据文本查询检索相关图像

在COCO文本到图像检索上达到58.1%准确率

视频理解

视频分类

无需微调即可对新视频进行分类

在Kinetics-400上达到76.9%准确率

视频检索

根据文本查询检索相关视频片段

在VTT文本到视频检索上达到51.2%准确率

🚀 感知编码器 (Perception Encoder)

感知编码器（Perception Encoder，PE）是一种通过简单的视觉 - 语言学习训练的、用于图像和视频理解的先进编码器。它能在多种视觉任务中展现出卓越性能，为下游任务提供强大且通用的特征。

🚀 快速开始

代码库安装

我们在 GitHub 上提供了预训练代码。你可以按照以下步骤进行安装：

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models

conda create --name perception_models python=3.12
conda activate perception_models

# 安装 PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124

# 我们使用 torchcodec 将视频解码为 PyTorch 张量
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124

pip install -e .

这样会安装一个可编辑版本的仓库，允许你对代码进行修改，而无需每次都重新安装包。

图像和文本特征提取

以下是使用训练好的模型进行图像和文本特征提取的示例代码：

import torch
from PIL import Image
import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms

print("CLIP 配置:", pe.CLIP.available_configs())
# CLIP 配置: ['PE-Core-G14-448', 'PE-Core-L14-336', 'PE-Core-B16-224']

model = pe.CLIP.from_config("PE-Core-G14-448", pretrained=True)  # 从 Hugging Face 下载
model = model.cuda()

preprocess = transforms.get_image_transform(model.image_size)
tokenizer = transforms.get_text_tokenizer(model.context_length)

image = preprocess(Image.open("docs/assets/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

with torch.no_grad(), torch.autocast("cuda"):
    image_features, text_features, logit_scale = model(image, text)
    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("标签概率:", text_probs)  # 输出: [[0.0, 0.0, 1.0]]

你可以在 GitHub 仓库中找到更多详细信息。

✨ 主要特性

先进性能：感知编码器（PE）是一系列大规模视觉编码器模型，在各种视觉任务中具有先进的性能。
强大特征：通过使用强大的对比预训练方法并在合成对齐视频上进行微调，PE 不仅在分类和检索任务上优于所有现有模型，还能在内部产生强大且通用的特征，适用于下游任务。
广泛应用：在零样本图像分类和检索以及零样本视频分类和检索等任务中都能取得出色的结果。

📦 安装指南

代码库安装

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models

conda create --name perception_models python=3.12
conda activate perception_models

# 安装 PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124

# 我们使用 torchcodec 将视频解码为 PyTorch 张量
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124

pip install -e .

💻 使用示例

基础用法

import torch
from PIL import Image
import core.vision_encoder.pe as pe
import core.vision_encoder.transforms as transforms

print("CLIP 配置:", pe.CLIP.available_configs())
# CLIP 配置: ['PE-Core-G14-448', 'PE-Core-L14-336', 'PE-Core-B16-224']

model = pe.CLIP.from_config("PE-Core-G14-448", pretrained=True)  # 从 Hugging Face 下载
model = model.cuda()

preprocess = transforms.get_image_transform(model.image_size)
tokenizer = transforms.get_text_tokenizer(model.context_length)

image = preprocess(Image.open("docs/assets/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

with torch.no_grad(), torch.autocast("cuda"):
    image_features, text_features, logit_scale = model(image, text)
    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("标签概率:", text_probs)  # 输出: [[0.0, 0.0, 1.0]]

📚 详细文档

模型详情

[📃 技术报告] [📂 Github]

感知编码器（PE）在论文 "Perception Encoder: The best visual embeddings are not at the output of the network" 中被提出。

模型开发者：Meta

模型概述：感知编码器（PE）是一系列大规模视觉编码器模型，在各种视觉任务中具有先进的性能。通过使用强大的对比预训练方法并在合成对齐视频上进行微调，PE 不仅在分类和检索任务上优于所有现有模型，还能在内部产生强大且通用的特征，适用于下游任务。PE 开启了大规模对比预训练向需要对齐调整的下游任务迁移的能力，以利用这些通用特征。

感知编码器：核心

PE 核心是我们的基础模型，使用强大的图像预训练计划进行训练，并在我们的合成视频数据引擎生成的数据上进行微调。

模型配置

PE 核心目前有 3 种尺寸。PE 核心 G 是主要的检查点，L 和 B 模型是从它蒸馏而来的。

规模	塔	参数	宽度	深度	MLP	头数	CLIP 维度	分辨率 / 上下文长度
B/16	视觉	0.09B	768	12	3072	12	1024	224px
	文本	0.31B	1024	24	4096	16	1024	32 个标记
L/14	视觉	0.32B	1024	24	4096	16	1024	336px
	文本	0.31B	1024	24	4096	16	1024	32 个标记
G/14	视觉	1.88B	1536	50	8960	16	1280	448px
	文本	0.47B	1280	24	5120	20	1280	72 个标记

所有 PE 核心模型在视觉塔顶部使用一个具有 8 个头的注意力池化块。L 和 B 模型还额外有一个用于全局聚合的类标记。更多详细信息请参阅论文。

模型性能

PE 核心在零样本图像分类和检索以及零样本视频分类和检索等任务中都取得了非常出色的结果。以下是其在这些领域的部分性能表现：

模型	检查点	IN-1k	IN-v2	IN-A	ObjectNet	COCO-T2I	Kinetics-400	VTT-T2I
B/16 224px	PE-Core-B16-224	78.4	71.7	62.4	71.9	50.9	65.6	47.6
L/14 336px	PE-Core-L14-336	83.5	77.9	89.0	84.7	57.1	73.4	50.3
G/14 448px	PE-Core-G14-448	85.4	80.2	92.6	88.2	58.1	76.9	51.2

PE 核心在 ObjectNet 和 ImageNet - A 等“困难”基准测试中表现尤其出色。

📄 许可证

本项目采用 Apache - 2.0 许可证。

📖 引用

如果你发现我们的代码对您的研究有用，请考虑引用以下论文：

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}