pseudo-flex-base开源摄影模型 - 基于SD2.1微调，支持多比例图像生成

首页

Pseudo Flex Base

由 bghira 开发

基于Stable Diffusion 2.1微调的多比例摄影模型，支持动态分辨率图像生成

图像生成开源协议:Openrail #多比例摄影 #高分辨率生成 #写实风格

下载量 70

发布时间 : 6/25/2023

模型简介

这是一个基于stable-diffusion-2-1微调的多比例摄影模型，专门优化了非标准比例图像的生成质量，解决了传统模型在宽幅/竖幅比例下生成效果异常的问题。

模型特点

多比例支持

通过比例分桶技术优化了非方形比例(如16:9,4:3等)的图像生成质量

高分辨率生成

基础分辨率为1024x1024，支持更高分辨率的图像生成

对比度优化

采用偏移噪声与SNR伽马技术改善图像对比度问题

多样化数据集

融合了柯达彩色幻灯片、Midjourney图像和国家地理等多源高质量数据

模型能力

文本生成图像

高分辨率图像生成

多比例图像生成

写实风格图像生成

使用案例

摄影艺术

人像摄影

生成各种比例的高质量人像照片

可生成不同比例(1:1,4:3,16:9等)的自然人像

风景摄影

生成宽幅自然风光图像

适合生成16:9等宽幅比例的风景照片

创意设计

广告素材

生成符合各种广告版式要求的图像

支持不同比例的广告素材生成

🚀 伪灵活基础模型（1024x1024 基础分辨率）

该模型是对 stable-diffusion-2-1 进行微调得到的摄影模型，支持不同的宽高比，能有效解决生成图像裁剪感和非方形图像生成效果不佳等问题。

🚀 快速开始

使用以下代码开始使用该模型：

# 使用 PyTorch 2！
import torch
from diffusers import StableDiffusionPipeline, DiffusionPipeline, AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel

# 任何当前在 Huggingface Hub 上的模型。
model_id = 'ptx0/pseudo-flex-base'
pipeline = DiffusionPipeline.from_pretrained(model_id)

# 优化！
pipeline.unet = torch.compile(pipeline.unet)
scheduler = DDPMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler"
)

# 如果出现错误，请移除这行代码。
torch.set_float32_matmul_precision('high')

pipeline.to('cuda')
prompts = {
    "woman": "a woman, hanging out on the beach",
    "man": "a man playing guitar in a park",
    "lion": "Explore the ++majestic beauty++ of untamed ++lion prides++ as they roam the African plains --captivating expressions-- in the wildest national geographic adventure",
    "child": "a child flying a kite on a sunny day",
    "bear": "best quality ((bear)) in the swiss alps cinematic 8k highly detailed sharp focus intricate fur",
    "alien": "an alien exploring the Mars surface",
    "robot": "a robot serving coffee in a cafe",
    "knight": "a knight protecting a castle",
    "menn": "a group of smiling and happy men",
    "bicycle": "a bicycle, on a mountainside, on a sunny day",
    "cosmic": "cosmic entity, sitting in an impossible position, quantum reality, colours",
    "wizard": "a mage wizard, bearded and gray hair, blue  star hat with wand and mystical haze",
    "wizarddd": "digital art, fantasy, portrait of an old wizard, detailed",
    "macro": "a dramatic city-scape at sunset or sunrise",
    "micro": "RNA and other molecular machinery of life",
    "gecko": "a leopard gecko stalking a cricket"
}
for shortname, prompt in prompts.items():
    # 旧提示：''
    image = pipeline(prompt=prompt,
        negative_prompt='malformed, disgusting, overexposed, washed-out',
        num_inference_steps=32, generator=torch.Generator(device='cuda').manual_seed(1641421826), 
        width=1368, height=720, guidance_scale=7.5, guidance_rescale=0.3, num_inference_steps=25).images[0]
    image.save(f'test/{shortname}_nobetas.png', format="PNG")

✨ 主要特性

基于 stable-diffusion-2-1 微调，支持不同宽高比，生成摄影风格图像。
解决了生成图像裁剪感和非方形图像生成效果不佳的问题。

📦 安装指南

所有预处理工作通过 GitHub 上 bghira/SimpleTuner 中的脚本完成。

💻 使用示例

基础用法

# 使用 PyTorch 2！
import torch
from diffusers import StableDiffusionPipeline, DiffusionPipeline, AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel

# 任何当前在 Huggingface Hub 上的模型。
model_id = 'ptx0/pseudo-flex-base'
pipeline = DiffusionPipeline.from_pretrained(model_id)

# 优化！
pipeline.unet = torch.compile(pipeline.unet)
scheduler = DDPMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler"
)

# 如果出现错误，请移除这行代码。
torch.set_float32_matmul_precision('high')

pipeline.to('cuda')
prompts = {
    "woman": "a woman, hanging out on the beach",
    "man": "a man playing guitar in a park",
    "lion": "Explore the ++majestic beauty++ of untamed ++lion prides++ as they roam the African plains --captivating expressions-- in the wildest national geographic adventure",
    "child": "a child flying a kite on a sunny day",
    "bear": "best quality ((bear)) in the swiss alps cinematic 8k highly detailed sharp focus intricate fur",
    "alien": "an alien exploring the Mars surface",
    "robot": "a robot serving coffee in a cafe",
    "knight": "a knight protecting a castle",
    "menn": "a group of smiling and happy men",
    "bicycle": "a bicycle, on a mountainside, on a sunny day",
    "cosmic": "cosmic entity, sitting in an impossible position, quantum reality, colours",
    "wizard": "a mage wizard, bearded and gray hair, blue  star hat with wand and mystical haze",
    "wizarddd": "digital art, fantasy, portrait of an old wizard, detailed",
    "macro": "a dramatic city-scape at sunset or sunrise",
    "micro": "RNA and other molecular machinery of life",
    "gecko": "a leopard gecko stalking a cricket"
}
for shortname, prompt in prompts.items():
    # 旧提示：''
    image = pipeline(prompt=prompt,
        negative_prompt='malformed, disgusting, overexposed, washed-out',
        num_inference_steps=32, generator=torch.Generator(device='cuda').manual_seed(1641421826), 
        width=1368, height=720, guidance_scale=7.5, guidance_rescale=0.3, num_inference_steps=25).images[0]
    image.save(f'test/{shortname}_nobetas.png', format="PNG")

📚 详细文档

模型详情

模型描述

对 stable-diffusion-2-1 进行微调，以支持动态宽高比。微调分辨率如下：

	宽度	高度	宽高比	图像数量
0	1024	1024	1:1	90561
1	1536	1024	3:2	8716
2	1365	1024	4:3	6933
3	1468	1024	~3:2	113
4	1778	1024	~5:3	6315
5	1200	1024	~5:4	6376
6	1333	1024	~4:3	2814
7	1281	1024	~5:4	52
8	1504	1024	~3:2	139
9	1479	1024	~3:2	25
10	1384	1024	~4:3	1676
11	1370	1024	~4:3	63
12	1499	1024	~3:2	436
13	1376	1024	~4:3	68

其他宽高比的图像数量较少。数据处理可能不够简洁或谨慎，但这是实验参数的一部分。

开发者：pseudoterminal
模型类型：基于扩散的文本到图像生成模型
语言：英语
许可证：creativeml-openrail-m
父模型：https://huggingface.co/ptx0/pseudo-real-beta
更多信息资源：需要更多信息

用途

详情请见：https://huggingface.co/stabilityai/stable-diffusion-2-1

训练详情

训练数据

LAION HD 数据集子集
- https://huggingface.co/datasets/laion/laion-high-resolution 我们仅使用了其中的一小部分，详见预处理。

预处理

所有预处理工作通过 GitHub 上 bghira/SimpleTuner 中的脚本完成。

速度、大小、时间

数据集大小：过滤后为 100k 图像 - 文本对。
硬件：1 块 A100 80G GPU
优化器：8bit Adam
批量大小：150
- 实际批量大小：15
- 梯度累积步数：10
- 有效批量大小：150
学习率：常数 4e-8，随时间通过减小批量大小进行调整。
训练步数：进行中（持续更新）
训练时间：到目前为止约 4 天

模型卡作者

pseudoterminal

🔧 技术细节

背景

ptx0/pseudo-real-beta 预训练检查点在多样化数据集上进行训练，Unet 训练 4200 步，文本编码器训练 15600 步，批量大小为 15，梯度累积次数为 10。数据集包括：

cushman（1939 年至 1969 年的 8000 张柯达彩色幻灯片）
midjourney v5.1 过滤后的数据（约 22000 张放大的 v5.1 图像）
《国家地理》（约 3 - 4000 张分辨率大于 1024x768 的动物、野生动物、风景、历史图像）
一小部分人物吸烟/ vaping 的库存图像

该模型具有生成逼真摄影和冒险风格图像的能力，且提示一致性强，但缺乏多宽高比处理能力。

训练代码

在训练循环数据加载器中添加了全面的宽高比分组支持，丢弃所有小于 1024x1024 的图像，并将所有图像调整为短边为 1024。根据图像的宽高比确定另一维度的新长度。所有批次的图像分辨率相同，相同宽高比的不同分辨率图像都调整为 1024x... 或 ...x1024。例如，1920x1080 的图像约调整为 1820x1024。

起始检查点

pseudo-flex-base 模型通过对 stabilityai/stable-diffusion-2-1 768 基础模型的冻结文本编码器进行微调得到，在 LAION HD 的 148000 张图像上训练 1000 步，使用 TEXT 字段作为图像的标题。批量大小实际上再次为 150（批量大小 15，梯度累积 10 次）。在非常高的分辨率下，训练速度非常慢，宽高比为 1.5 - 1.7 时，在 A100 80G 上每次迭代约需 700 秒。整个训练过程持续了两天。

文本编码器交换

在 1000 步时，实验性地使用 ptx0/pseudo-real-beta 的文本编码器与该模型的 Unet 结合，以解决一些残留的图像噪声问题，如像素化。结果证明这是有效的。训练从检查点 1000 开始，使用新的文本编码器重新启动。

宽/竖屏宽高比的出现

在 1300 到 2950 步之间，验证提示开始“整合”。一些检查点出现了性能下降，但通常在约 100 步内得到解决。尽管有下降情况，但总体上仍有改进。

图像质量下降和数据集交换

由于在 148000 张图像上以批量大小 150 进行了 3000 步的训练，图像开始出现质量下降。这可能是因为数据集中的所有图像都被重复使用了 3 次，而且考虑到一些图像过滤器丢弃了约 50000 张图像，在超低学习率下，每张图像实际上被使用了 9 次。这导致了以下问题：

图像开始出现静态噪声。
训练时间过长，每个检查点的改进很小。
对提示词汇过拟合，缺乏泛化能力。

因此，在 1300 步时，决定停止在原始 LAION HD 数据集上的训练，转而在新获取的高分辨率 Midjourney v5.1 数据子集上进行训练。该子集包含 17800 张基础分辨率为 1024x1024 的图像，其中约 700 张为竖屏，700 张为横屏。

对比度问题

在测试检查点 3275 时，发现较暗的图像变得模糊，较亮的图像效果不佳。测试了各种 CFG 缩放和引导级别，最佳的暗图像效果出现在 guidance_scale = 9.2 和 guidance_rescale = 0.0 时，但图像仍然“模糊”。

第二次数据集更改

准备了一个新的 LAION 子集，包含唯一图像且没有方形图像，仅包含有限的宽高比：

16:9
9:16
2:3
3:2

这旨在加快模型的学习速度，并防止对标题过拟合。该 LAION 子集包含 17800 张图像，宽高比分布均匀。然后使用 T5 Flan 和 BLIP2 对图像进行标题标注，以获得高精度的结果。

对比度修复：偏移噪声 / SNR gamma 的作用？

在检查点 4250 上实验性地应用了偏移噪声和 SNR gamma：

snr_gamma = 5.0
noise_offset = 0.2
noise_pertubation = 0.1

在训练 25 步内，对比度恢复，提示 a solid black square 再次产生了合理的结果。在偏移噪声训练 50 步时，效果明显改善，a solid black square 的变形最少。第 75 步的检查点出现问题，SNR gamma 计算导致数值不稳定，因此禁用了该参数，偏移噪声参数保持不变。