sotediffusion-wuerstchen3开源模型 - 免费生成高质量动漫风格图像

首页

Sotediffusion Wuerstchen3

由 Disty0 开发

基于Würstchen V3的动漫风格微调模型，专注于生成高质量的动漫风格图像

图像生成英语开源协议:其他 #动漫风格生成 #高分辨率图像 #文本到图像转换

下载量 467

发布时间 : 6/10/2024

模型简介

这是一个基于Würstchen V3架构的动漫风格文本生成图像模型，经过600万张图像的微调训练，能够生成高质量的动漫风格图像。

模型特点

高质量动漫风格

专注于生成高质量的动漫风格图像

大规模训练

使用8块A100 80G显卡训练了600万张图像

API支持

可通过Fal.AI的API调用使用

模型能力

文本生成图像

动漫风格图像生成

高分辨率图像生成

使用案例

创意艺术

动漫角色设计

根据文本描述生成动漫角色概念图

高质量动漫风格角色图像

动漫场景生成

根据文本描述生成动漫风格的场景

1024x1536或更高分辨率的场景图像

🚀 SoteDiffusion Wuerstchen3

SoteDiffusion Wuerstchen3 是对 Würstchen V3 进行的动漫微调模型，可用于将文本转化为动漫风格的图像。

新版本信息

新版本已发布：https://huggingface.co/Disty0/sotediffusion-v2

🚀 快速开始

本模型可通过 API 与 Fal.AI 结合使用，更多详情请参考：https://fal.ai/models/fal-ai/stable-cascade/sote-diffusion

✨ 主要特性

本版本由 fal.ai/grants 赞助发布。
使用 8 块 A100 80G GPU，在 600 万张图像上进行了 3 个轮次的训练。

📦 安装指南

SD.Next

访问：https://github.com/vladmandic/automatic/
进入 Models -> Huggingface，在模型名称中输入 Disty0/sotediffusion-wuerstchen3-decoder 并点击下载。
下载完成后，加载 Disty0/sotediffusion-wuerstchen3-decoder。

ComfyUI

请参考 CivitAI：https://civitai.com/models/353284

💻 使用示例

基础用法

import torch
from diffusers import StableCascadeCombinedPipeline

device = "cuda"
dtype = torch.bfloat16 # or torch.float16
model = "Disty0/sotediffusion-wuerstchen3-decoder"

pipe = StableCascadeCombinedPipeline.from_pretrained(model, torch_dtype=dtype)

# send everything to the gpu:
pipe = pipe.to(device, dtype=dtype)
pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype)

# or enable model offload to save vram:
# pipe.enable_model_cpu_offload()

prompt = "newest, extremely aesthetic, best quality, 1girl, solo, cat ears, pink hair, orange eyes, long hair, bare shoulders, looking at viewer, smile, indoors, casual, living room, playing guitar,"
negative_prompt = "very displeasing, worst quality, monochrome, realistic, oldest, loli,"
output = pipe(
    width=1024,
    height=1536,
    prompt=prompt,
    negative_prompt=negative_prompt,
    decoder_guidance_scale=2.0,
    prior_guidance_scale=7.0,
    prior_num_inference_steps=30,
    output_type="pil",
    num_inference_steps=10
).images[0]

## do something with the output image

📚 详细文档

模型参数

基础训练参数

参数	值
amp	bf16
weights	fp32
save weights	fp16
resolution	1024x1024
effective batch size	128
unet learning rate	1e-5
te learning rate	4e-6
optimizer	Adafactor
images	6M
epochs	3

最终训练参数

参数	值
amp	bf16
weights	fp32
save weights	fp16
resolution	1024x1024
effective batch size	128
unet learning rate	4e-6
te learning rate	none
optimizer	Adafactor
images	120K
epochs	16

数据集信息

数据集规模

数据集名称	总图像数
newest	1,848,331
recent	1,380,630
mid	993,227
early	566,152
oldest	160,397
pixiv	343,614
visual novel cg	231,358
anime wallpaper	104,790
Total	5,628,499

数据集说明

最小尺寸为 1280x600（768,000 像素）。
使用 czkawka-cli 基于图像相似度进行去重。
约 120K 张高质量图像有意重复 5 次，使总图像数达到 620 万。

标签信息

标签顺序

模型以随机标签顺序进行训练，但数据集中的标签顺序如下：

aesthetic tags, quality tags, date tags, custom tags, rating tags, character, series, rest of the tags

日期标签

标签	日期
newest	2022 至 2024
recent	2019 至 2021
mid	2015 至 2018
early	2011 至 2014
oldest	2005 至 2010

美学标签

分数大于	标签	数量
0.90	extremely aesthetic	125,451
0.80	very aesthetic	887,382
0.70	aesthetic	1,049,857
0.50	slightly aesthetic	1,643,091
0.40	not displeasing	569,543
0.30	not aesthetic	445,188
0.20	slightly displeasing	341,424
0.10	displeasing	237,660
rest of them	very displeasing	328,712

质量标签

分数大于	标签	数量
0.980	best quality	1,270,447
0.900	high quality	498,244
0.750	great quality	351,006
0.500	medium quality	366,448
0.250	normal quality	368,380
0.125	bad quality	279,050
0.025	low quality	538,958
rest of them	worst quality	1,955,966

评级标签

标签	数量
general	1,416,451
sensitive	3,447,664
nsfw	427,459
explicit nsfw	336,925

自定义标签

数据集名称	自定义标签
image boards	date,
text	The text says "text",
characters	character, series
pixiv	art by Display_Name,
visual novel cg	Full_VN_Name (short_3_letter_name), visual novel cg,
anime wallpaper	date, anime wallpaper,

🔧 技术细节

训练信息

使用软件：Kohya SD-Scripts with Stable Cascade branch。https://github.com/kohya-ss/sd-scripts/tree/stable-cascade
使用 GPU：8x Nvidia A100 80GB
GPU 时长：220 小时

标注信息

用于标注的 GPU：1x Intel ARC A770 16GB
GPU 时长：350 小时
用于标注的模型：SmilingWolf/wd-swinv2-tagger-v3
用于文本的模型：llava-hf/llava-1.5-7b-hf
标注命令：

python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./