Wan2.1-I2V-14B-720P-Diffusers开源视频模型 - 消费级GPU可用，支持视觉文本生成

首页

Wan2.1 I2V 14B 720P Diffusers

由 grnr9730 开发

万2.1是一套全面开放的视频基础模型，具有顶尖性能，支持消费级GPU，多任务支持，视觉文本生成和高效视频VAE。

视频处理支持多种语言开源协议:Apache-2.0 #高清视频生成 #多语言文本支持 #低显存需求

下载量 96

发布时间 : 4/2/2025

模型简介

万2.1是一个开放且先进的大规模视频生成模型，支持图像转视频等多种任务，在多个基准测试中表现优异。

模型特点

顶尖性能

在多个基准测试中持续超越现有开源模型和商业解决方案。

支持消费级GPU

T2V-1.3B模型仅需8.19GB显存，兼容几乎所有消费级GPU。

多任务支持

在文本转视频、图像转视频、视频编辑、文本转图像及视频转音频等任务中表现卓越。

视觉文本生成

首个支持中英文文本生成的视频模型，具备强大的文本生成能力。

高效视频VAE

万-VAE在编码和解码任意长度的1080P视频时保持时间信息完整。

模型能力

图像转视频

文本转视频

视频编辑

文本转图像

视频转音频

使用案例

创意内容生成

广告视频生成

根据静态图像和文本描述生成动态广告视频。

生成高质量、具有吸引力的广告内容。

社交媒体内容

将用户上传的图片转换为短视频内容。

提升用户参与度和内容多样性。

教育培训

教学视频生成

将教材中的静态图表转换为动态演示视频。

增强教学材料的互动性和理解度。

🚀 Wan2.1

Wan2.1 是一套全面且开放的视频基础模型套件，突破了视频生成的界限。它具备卓越的性能、支持消费级 GPU、可处理多种任务、能进行视觉文本生成，还拥有强大的视频 VAE，为视频生成领域带来了新的突破。

Wan：开放且先进的大规模视频生成模型

在这个仓库中，我们推出了 Wan2.1，这是一套全面且开放的视频基础模型套件，突破了视频生成的界限。Wan2.1 具备以下关键特性：

👍 SOTA 性能：在多个基准测试中，Wan2.1 始终优于现有的开源模型和最先进的商业解决方案。
👍 支持消费级 GPU：T2V - 1.3B 模型仅需 8.19 GB 的显存，几乎与所有消费级 GPU 兼容。在 RTX 4090 上，它大约可以在 4 分钟内生成一个 5 秒的 480P 视频（不使用量化等优化技术）。其性能甚至可与一些闭源模型相媲美。
👍 多任务支持：Wan2.1 在文本到视频、图像到视频、视频编辑、文本到图像和视频到音频等任务中表现出色，推动了视频生成领域的发展。
👍 视觉文本生成：Wan2.1 是首个能够同时生成中文和英文文本的视频模型，强大的文本生成能力增强了其实际应用价值。
👍 强大的视频 VAE：Wan - VAE 具有出色的效率和性能，能够对任意长度的 1080P 视频进行编码和解码，同时保留时间信息，是视频和图像生成的理想基础。

本仓库包含我们的 I2V - 14B 模型，该模型能够生成 720P 的高清视频。经过数千轮的人工评估，该模型的性能优于闭源和开源的替代方案，达到了最先进的水平。

🎥 视频演示

🔥 最新消息！

2025 年 2 月 25 日：👋 我们发布了 Wan2.1 的推理代码和权重。

📑 待办事项列表

Wan2.1 文本到视频
- [x] 14B 和 1.3B 模型的多 GPU 推理代码
- [x] 14B 和 1.3B 模型的检查点
- [x] Gradio 演示
- [x] Diffusers 集成
- [ ] ComfyUI 集成
Wan2.1 图像到视频
- [x] 14B 模型的多 GPU 推理代码
- [x] 14B 模型的检查点
- [x] Gradio 演示
- [x] Diffusers 集成
- [ ] ComfyUI 集成

🚀 快速开始

📦 安装

克隆仓库：

git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1

安装依赖：

# 确保 torch >= 2.4.0
pip install -r requirements.txt

📥 模型下载

模型	下载链接	注意事项
T2V - 14B	🤗 Huggingface 🤖 ModelScope	支持 480P 和 720P
I2V - 14B - 720P	🤗 Huggingface 🤖 ModelScope	支持 720P
I2V - 14B - 480P	🤗 Huggingface 🤖 ModelScope	支持 480P
T2V - 1.3B	🤗 Huggingface 🤖 ModelScope	支持 480P

💡 注意：1.3B 模型能够生成 720P 分辨率的视频。然而，由于在该分辨率下的训练有限，与 480P 相比，结果通常不太稳定。为了获得最佳性能，我们建议使用 480P 分辨率。

使用 🤗 huggingface - cli 下载模型：

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local-dir ./Wan2.1-I2V-14B-720P-Diffusers

使用 🤖 modelscope - cli 下载模型：

pip install modelscope
modelscope download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local_dir ./Wan2.1-I2V-14B-720P-Diffusers

💻 运行图像到视频生成

与文本到视频类似，图像到视频也分为有无提示扩展步骤的过程。具体参数及其对应设置如下：

任务	480P 分辨率	720P 分辨率	模型
i2v - 14B	❌	✔️	Wan2.1 - I2V - 14B - 720P
i2v - 14B	✔️	❌	Wan2.1 - T2V - 14B - 480P

(1) 无提示扩展

单 GPU 推理

python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

💡 对于图像到视频任务，size 参数表示生成视频的面积，宽高比遵循原始输入图像的宽高比。

使用 FSDP + xDiT USP 的多 GPU 推理

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

Wan 也可以直接使用 🤗 Diffusers 运行！

import torch
import numpy as np
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel

# 可用模型：Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
model_id = "Wan-AI/Wan2.1-I2V-14B-720P-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = (
    "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
    "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
)
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(
    image=image, prompt=prompt, negative_prompt=negative_prompt, height=height, width=width, num_frames=81, guidance_scale=5.0
).frames[0]
export_to_video(output, "output.mp4", fps=16)

(2) 使用提示扩展

使用 Qwen/Qwen2.5 - VL - 7B - Instruct 进行本地提示扩展运行：

python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

使用 dashscope 进行远程提示扩展运行：

DASH_API_KEY=your_key python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_method 'dashscope' --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

(3) 运行本地 Gradio

cd gradio
# 如果只在 Gradio 中使用 480P 模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P

# 如果只在 Gradio 中使用 720P 模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_720p ./Wan2.1-I2V-14B-720P

# 如果在 Gradio 中同时使用 480P 和 720P 模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P --ckpt_dir_720p ./Wan2.1-I2V-14B-720P

👨‍⚖️ 人工评估

我们进行了广泛的人工评估，以评估图像到视频模型的性能，结果如下表所示。结果清楚地表明，Wan2.1 优于闭源和开源模型。

💪 不同 GPU 上的计算效率

我们在不同的 GPU 上测试了不同 Wan2.1 模型的计算效率，结果如下表所示。结果以 总时间 (s) / 峰值 GPU 显存 (GB) 的格式呈现。

此表中测试的参数设置如下： (1) 对于 8 个 GPU 上的 1.3B 模型，设置 --ring_size 8 和 --ulysses_size 1； (2) 对于 1 个 GPU 上的 14B 模型，使用 --offload_model True； (3) 对于单个 4090 GPU 上的 1.3B 模型，设置 --offload_model True --t5_cpu； (4) 对于所有测试，均未应用提示扩展，即未启用 --use_prompt_extend。

📚 Wan2.1 介绍

Wan2.1 基于主流的扩散变压器范式设计，通过一系列创新在生成能力方面取得了显著进展。这些创新包括我们新颖的时空变分自编码器 (VAE)、可扩展的训练策略、大规模数据构建和自动评估指标。这些贡献共同提升了模型的性能和通用性。

(1) 3D 变分自编码器

我们提出了一种新颖的 3D 因果 VAE 架构，称为 Wan - VAE，专门为视频生成而设计。通过结合多种策略，我们提高了时空压缩率，减少了内存使用，并确保了时间因果性。与其他开源 VAE 相比，Wan - VAE 在性能效率方面显示出显著优势。此外，我们的 Wan - VAE 可以对无限长度的 1080P 视频进行编码和解码，而不会丢失历史时间信息，特别适合视频生成任务。

(2) 视频扩散 DiT

Wan2.1 在主流扩散变压器范式内使用流匹配框架进行设计。我们的模型架构使用 T5 编码器对多语言文本输入进行编码，每个变压器块中的交叉注意力将文本嵌入到模型结构中。此外，我们使用一个带有线性层和 SiLU 层的 MLP 来处理输入的时间嵌入，并分别预测六个调制参数。这个 MLP 在所有变压器块中共享，每个块学习一组不同的偏差。我们的实验结果表明，在相同的参数规模下，这种方法显著提高了性能。

模型	维度	输入维度	输出维度	前馈维度	频率维度	头数	层数
1.3B	1536	16	16	8960	256	12	30
14B	5120	16	16	13824	256	40	40

数据

我们策划并去重了一个包含大量图像和视频数据的候选数据集。在数据策划过程中，我们设计了一个四步的数据清理过程，重点关注基本维度、视觉质量和运动质量。通过强大的数据处理管道，我们可以轻松获得高质量、多样化和大规模的图像和视频训练集。

与 SOTA 的比较

我们将 Wan2.1 与领先的开源和闭源模型进行了比较，以评估其性能。我们使用精心设计的 1035 个内部提示，在 14 个主要维度和 26 个子维度上进行了测试。然后，我们通过对每个维度的分数进行加权计算来计算总分，权重来自匹配过程中的人类偏好。详细结果如下表所示。这些结果表明，我们的模型与开源和闭源模型相比具有优越的性能。

📝 引用

如果您觉得我们的工作有帮助，请引用我们：

@article{wan2.1,
    title   = {Wan: Open and Advanced Large-Scale Video Generative Models},
    author  = {Wan Team},
    journal = {},
    year    = {2025}
}

📄 许可证

本仓库中的模型遵循 Apache 2.0 许可证。我们对您生成的内容不主张任何权利，允许您自由使用这些内容，但需确保您的使用符合本许可证的规定。您对模型的使用负全部责任，不得使用模型分享任何违反适用法律、对个人或群体造成伤害、传播用于伤害的个人信息、传播错误信息或针对弱势群体的内容。有关完整的限制列表和您的权利详情，请参阅许可证的全文。