pipeline_tag: 文本生成图像
高效高分辨率扩散模型的深度压缩自编码器
[论文] [GitHub]
本仓库包含论文SANA 1.5:线性扩散Transformer中训练时与推理时计算的高效扩展的模型实现。
项目主页:https://hanlab.mit.edu/projects/sana/

图1:我们解决了高空间压缩比自编码器的重建精度下降问题

图2:DC-AE在不损失性能的前提下显著提升训练与推理速度

图3:DC-AE实现了笔记本电脑端的高效文生图
摘要
我们提出深度压缩自编码器(DC-AE)——一种用于加速高分辨率扩散模型的新型自编码器架构。现有自编码器在中度空间压缩比(如8倍)下表现优异,但在高压缩比(如64倍)时难以保持理想的重建精度。我们通过两项关键技术解决这一挑战:(1) 残差自编码:设计模型基于空间-通道变换特征学习残差,缓解高压缩比自编码器的优化难题;(2) 解耦高分辨率适配:采用高效的三阶段解耦训练策略,减轻高压缩比自编码器的泛化惩罚。通过这些设计,我们将自编码器空间压缩比提升至128倍同时保持重建质量。将DC-AE应用于潜在扩散模型时,我们实现了无精度损失的显著加速。例如在ImageNet 512x512数据集上,相较于广泛使用的SD-VAE-f8自编码器,DC-AE为UViT-H模型带来19.1倍推理加速和17.9倍训练加速,同时获得更优的FID指标。
使用指南
深度压缩自编码器
from efficientvit.ae_model_zoo import DCAE_HF
dc_ae = DCAE_HF.from_pretrained("mit-han-lab/dc-ae-f64c128-in-1.0")
from PIL import Image
import torch
import torchvision.transforms as transforms
from torchvision.utils import save_image
from efficientvit.apps.utils.image import DMCrop
device = torch.device("cuda")
dc_ae = dc_ae.to(device).eval()
transform = transforms.Compose([
DMCrop(512),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])
image = Image.open("assets/fig/girl.png")
x = transform(image)[None].to(device)
latent = dc_ae.encode(x)
print(latent.shape)
y = dc_ae.decode(latent)
save_image(y * 0.5 + 0.5, "demo_dc_ae.png")
基于DC-AE的高效扩散模型
from efficientvit.diffusion_model_zoo import DCAE_Diffusion_HF
dc_ae_diffusion = DCAE_Diffusion_HF.from_pretrained("mit-han-lab/dc-ae-f64c128-in-1.0-uvit-h-in-512px-train2000k")
import torch
import numpy as np
from torchvision.utils import save_image
torch.set_grad_enabled(False)
device = torch.device("cuda")
dc_ae_diffusion = dc_ae_diffusion.to(device).eval()
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
eval_generator = torch.Generator(device=device)
eval_generator.manual_seed(seed)
prompts = torch.tensor(
[279, 333, 979, 936, 933, 145, 497, 1, 248, 360, 793, 12, 387, 437, 938, 978], dtype=torch.int, device=device
)
num_samples = prompts.shape[0]
prompts_null = 1000 * torch.ones((num_samples,), dtype=torch.int, device=device)
latent_samples = dc_ae_diffusion.diffusion_model.generate(prompts, prompts_null, 6.0, eval_generator)
latent_samples = latent_samples / dc_ae_diffusion.scaling_factor
image_samples = dc_ae_diffusion.autoencoder.decode(latent_samples)
save_image(image_samples * 0.5 + 0.5, "demo_dc_ae_diffusion.png", nrow=int(np.sqrt(num_samples)))
引用
若DC-AE对您的研究有所助益,请通过引用我们的论文给予认可:
@article{chen2024deep,
title={Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models},
author={Chen, Junyu and Cai, Han and Chen, Junsong and Xie, Enze and Yang, Shang and Tang, Haotian and Li, Muyang and Lu, Yao and Han, Song},
journal={arXiv preprint arXiv:2410.10733},
year={2024}
}
SANA 1.5相关工作可引用如下:
@misc{xie2025sana,
title={SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer},
author={Enze Xie and Junyu Chen and Han Cai and Junsong Chen and Haotian Tang and Yao Lu and Song Han},
year={2025},
eprint={2501.18427},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.18427},
}