AltDiffusion-m9开源多语言图像生成模型 - 支持9种语言文本转图像

Altdiffusion M9

由 BAAI 开发

AltDiffusion-m9是基于Stable Diffusion框架的多语言文本到图像生成模型，支持9种语言，由AltCLIP-m9多语言CLIP模型提供支持。

文本生成图像支持多种语言开源协议:Openrail #多语言文生图 #跨语言对齐 #高保真生成

下载量 46

发布时间 : 11/18/2022

模型简介

AltDiffusion-m9是一个多语言文本到图像生成模型，基于Stable Diffusion框架，采用AltCLIP-m9多语言CLIP模型，使用悟道数据集和LAION数据进行训练。该模型在多语言对齐方面表现卓越，是目前开源领域最强的多语言文本到图像模型之一。

模型特点

多语言支持

支持9种语言的文本到图像生成，包括英语、中文、西班牙语等。

高质量图像生成

在多语言对齐方面表现卓越，部分案例中展现出比原版Stable Diffusion更优的生成效果。

商业用途友好

允许商业用途及模型权重再分发，但须包含相同使用限制并向所有用户提供许可证副本。

模型能力

文本到图像生成

多语言文本理解

高质量图像合成

使用案例

创意设计

角色设计

根据多语言文本描述生成角色图像，如'黑暗精灵公主'。

生成具有详细幻想风格的角色图像。

场景设计

根据文本描述生成特定场景的图像。

生成符合描述的详细场景图像。

艺术创作

数字绘画

根据艺术家的描述生成数字绘画作品。

生成具有艺术价值的数字绘画。

🚀 AltDiffusion

AltDiffusion是一个多语言文本到图像的扩散模型，支持多种语言，可生成高质量的图像，在多语言对齐方面表现出色，保留了原版Stable Diffusion的大部分能力。

🚀 快速开始

Gradio使用

我们支持通过 Gradio Web UI 运行 AltDiffusion-m9：

模型权重下载

第一次运行AltDiffusion-m9模型时会自动从huggingface下载如下权重：

模型名称 Model name	大小 Size	描述 Description
StableDiffusionSafetyChecker	1.13G	图片的安全检查器；Safety checker for image
AltDiffusion-m9	8.0G	支持英语(En)、中文(Zh)、西班牙语(Es)、法语(Fr)、俄语(Ru)、日语(Ja)、韩语(Ko)、阿拉伯语(Ar)和意大利语(It)
AltCLIP-m9	3.22G	支持英语(En)、中文(Zh)、西班牙语(Es)、法语(Fr)、俄语(Ru)、日语(Ja)、韩语(Ko)、阿拉伯语(Ar)和意大利语(It)

示例代码运行

🧨Diffusers示例

AltDiffusion-m9 已被添加到 🧨Diffusers！我们的代码示例已放到colab上，欢迎使用。您可以在此处查看文档页面。

以下示例将使用fast DPM调度程序生成图像，在V100上耗时大约为2秒。

from diffusers import AltDiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

prompt = "黑暗精灵公主，非常详细，幻想，非常详细，数字绘画，概念艺术，敏锐的焦点，插图"
# or in English:
# prompt = "dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap"

image = pipe(prompt, num_inference_steps=25).images[0]
image.save("./alt.png")

alt

Transformers示例

import os
import torch
import transformers
from transformers import BertPreTrainedModel
from transformers.models.clip.modeling_clip import CLIPPreTrainedModel
from transformers.models.xlm_roberta.tokenization_xlm_roberta import XLMRobertaTokenizer
from diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler
from diffusers import StableDiffusionPipeline
from transformers import BertPreTrainedModel,BertModel,BertConfig
import torch.nn as nn
import torch
from transformers.models.xlm_roberta.configuration_xlm_roberta import XLMRobertaConfig
from transformers import XLMRobertaModel
from transformers.activations import ACT2FN
from typing import Optional


class RobertaSeriesConfig(XLMRobertaConfig):
    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2,project_dim=768,pooler_fn='cls',learn_encoder=False, **kwargs):
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
        self.project_dim = project_dim
        self.pooler_fn = pooler_fn
        # self.learn_encoder = learn_encoder

class RobertaSeriesModelWithTransformation(BertPreTrainedModel):
    _keys_to_ignore_on_load_unexpected = [r"pooler"]
    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
    base_model_prefix = 'roberta'
    config_class= XLMRobertaConfig
    def __init__(self, config):
        super().__init__(config)
        self.roberta = XLMRobertaModel(config)
        self.transformation = nn.Linear(config.hidden_size, config.project_dim)
        self.post_init()
        
    def get_text_embeds(self,bert_embeds,clip_embeds):
        return self.merge_head(torch.cat((bert_embeds,clip_embeds)))

    def set_tokenizer(self, tokenizer):
        self.tokenizer = tokenizer

    def forward(self, input_ids: Optional[torch.Tensor] = None) :
        attention_mask = (input_ids != self.tokenizer.pad_token_id).to(torch.int64)
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        
        projection_state = self.transformation(outputs.last_hidden_state)
        
        return (projection_state,)

model_path_encoder = "BAAI/RobertaSeriesModelWithTransformation"
model_path_diffusion = "BAAI/AltDiffusion-m9"
device = "cuda"

seed = 12345
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path_encoder, use_auth_token=True)
tokenizer.model_max_length = 77

text_encoder = RobertaSeriesModelWithTransformation.from_pretrained(model_path_encoder, use_auth_token=True)
text_encoder.set_tokenizer(tokenizer)
print("text encode loaded")
pipe = StableDiffusionPipeline.from_pretrained(model_path_diffusion,
                                               tokenizer=tokenizer,
                                               text_encoder=text_encoder,
                                               use_auth_token=True,
                                               )
print("diffusion pipeline loaded")
pipe = pipe.to(device)

prompt = "Thirty years old lee evans as a sad 19th century postman. detailed, soft focus, candle light, interesting lights, realistic, oil canvas, character concept art by munkácsy mihály, csók istván, john everett millais, henry meynell rheam, and da vinci"
with torch.no_grad():
    image = pipe(prompt, guidance_scale=7.5).images[0]  
    
image.save("3.png")

您可以在predict_generate_images函数里通过改变参数来调整设置，具体信息如下：

参数名 Parameter	类型 Type	描述 Description
prompt	str	提示文本; The prompt text
out_path	str	输出路径; The output path to save images
n_samples	int	输出图片数量; Number of images to be generate
skip_grid	bool	如果为True，会将所有图片拼接在一起，输出一张新的图片; If set to true, image gridding step will be skipped
ddim_step	int	DDIM模型的步数; Number of steps in ddim model
plms	bool	如果为True，则会使用plms模型; If set to true, PLMS Sampler instead of DDIM Sampler will be applied
scale	float	这个值决定了文本在多大程度上影响生成的图片，值越大影响力越强; This value determines how important the prompt incluences generate images
H	int	图片的高度; Height of image
W	int	图片的宽度; Width of image
C	int	图片的channel数; Numeber of channels of generated images
seed	int	随机种子; Random seed number

⚠️ 重要提示

模型推理要求一张至少10G以上的GPU。

✨ 主要特性

多语言支持：支持英语(En)、中文(Zh)、西班牙语(Es)、法语(Fr)、俄语(Ru)、日语(Ja)、韩语(Ko)、阿拉伯语(Ar)和意大利语(It)等多种语言。
多模态任务：可用于多模态任务，在多语言对齐方面表现出色。
强大性能：保留了原版Stable Diffusion的大部分能力，在某些例子上比原版模型更出色。

📦 安装指南

First you should install diffusers main branch and some dependencies:

pip install git+https://github.com/huggingface/diffusers.git torch transformers accelerate sentencepiece

📚 详细文档

模型信息

我们使用 AltCLIP-m9，基于 Stable Diffusion 训练了双语Diffusion模型，训练数据来自 WuDao数据集和 LAION 。

我们的版本在多语言对齐方面表现非常出色，是目前市面上开源的最强多语言版本，保留了原版stable diffusion的大部分能力，并且在某些例子上比有着比原版模型更出色的能力。

AltDiffusion-m9 模型由名为 AltCLIP-m9 的多语 CLIP 模型支持，该模型也可在本项目中访问。您可以阅读此教程了解更多信息。

模型参数量

模块名称 Module Name	参数量 Number of Parameters
AutoEncoder	83.7M
Unet	865M
AltCLIP-m9 TextEncoder	859M

🔧 技术细节

关于AltCLIP-m9，我们已经推出了相关报告，有更多细节可以查阅，如对您的工作有帮助，欢迎引用。

@article{https://doi.org/10.48550/arxiv.2211.06679,
  doi = {10.48550/ARXIV.2211.06679},
  url = {https://arxiv.org/abs/2211.06679},
  author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
  title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Please cite our paper if you find it helpful :)

@misc{ye2023altdiffusion,
      title={AltDiffusion: A Multilingual Text-to-Image Diffusion Model}, 
      author={Fulong Ye and Guang Liu and Xinya Wu and Ledell Wu},
      year={2023},
      eprint={2308.09991},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 许可证

该模型通过 CreativeML Open RAIL-M license 获得许可。作者对您生成的输出不主张任何权利，您可以自由使用它们并对它们的使用负责，不得违反本许可中的规定。该许可证禁止您分享任何违反任何法律、对他人造成伤害、传播任何可能造成伤害的个人信息、传播错误信息和针对弱势群体的任何内容。您可以出于商业目的修改和使用模型，但必须包含相同使用限制的副本。有关限制的完整列表，请阅读许可证。

模型基本信息

名称 Name	任务 Task	语言 Language(s)	模型 Model	Github
AltDiffusion-m9	多模态 Multimodal	Multilingual	Stable Diffusion	FlagAI

Altdiffusion M9

模型简介

模型特点

模型能力

使用案例

🚀 AltDiffusion

🚀 快速开始

Gradio使用

模型权重下载

示例代码运行

🧨Diffusers示例

Transformers示例

✨ 主要特性

📦 安装指南

📚 详细文档

模型信息

模型参数量

🔧 技术细节

📄 许可证

模型基本信息

更多生成结果

多语言示例

中英文对齐能力

prompt:dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap

英文生成结果/Generated results from English prompts

prompt:黑暗精灵公主，非常详细，幻想，非常详细，数字绘画，概念艺术，敏锐的焦点，插图

中文生成结果/Generated results from Chinese prompts

中文表现能力

prompt:带墨镜的男孩肖像，充满细节，8K高清

prompt:带墨镜的中国男孩肖像，充满细节，8K高清

长图生成能力

prompt: 一只带着帽子的小狗

原版 stable diffusion：

Ours: