Llasa-3B开源文本转语音模型 - 免费使用支持中英文语音生成

首页

Llasa 3B

由 unsloth 开发

Llasa是一个基于LLaMA的文本转语音(TTS)系统，通过整合语音标记扩展了语言模型的能力，支持中英文语音生成。

语音合成

Safetensors

支持多种语言#多语言语音合成 #语音提示生成 #大语言模型扩展

下载量 55

发布时间 : 5/15/2025

模型简介

Llasa是一个文本转语音(TTS)系统，通过整合来自XCodec2码本的65,536个语音标记，扩展了基于文本的LLaMA语言模型。该模型能够仅从输入文本或利用给定的语音提示生成语音。

模型特点

训练时和推理时计算扩展

支持在训练和推理阶段进行扩展计算，提高模型性能

多语言支持

支持中文和英文的语音生成

语音提示生成

能够利用给定的语音提示生成语音

高效训练

训练TTS类似于训练LLM，可利用现有LLM的压缩、加速和微调方法

模型能力

文本转语音

语音提示生成

中英文语音合成

使用案例

语音合成

语音助手

为虚拟助手生成自然语音

生成高质量的语音输出

有声读物

将文本内容转换为语音

生成自然流畅的语音

语音提示应用

语音风格转换

根据给定的语音提示生成相似风格的语音

保持语音风格一致性

🚀 语音合成模型项目

本项目专注于文本转语音（TTS）技术，基于强大的基础模型，能够高效地将文本转化为自然流畅的语音。通过独特的训练方法和技术架构，为用户提供高质量的语音合成服务。

🚀 快速开始

查看模型集合：点击查看我们上传的所有TTS模型。
学习微调TTS模型：阅读指南，了解如何微调TTS模型。
了解Unsloth Dynamic 2.0：Unsloth Dynamic 2.0实现了卓越的准确性，性能优于其他领先的量化方法。

✨ 主要特性

支持多基础模型：基于meta-llama/Llama-3.2-3B-Instruct和HKUSTAudio/Llasa-3B等基础模型。
高性能表现：Unsloth Dynamic 2.0实现了卓越的准确性，性能优于其他领先的量化方法。
多语言支持：支持中文和英文的语音合成。

📦 安装指南

安装 XCodec2。

💻 使用示例

基础用法

仅从输入文本进行语音合成：

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf

llasa_3b ='HKUSTAudio/Llasa-3B'

tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval() 
model.to('cuda')

from xcodec2.modeling_xcodec2 import XCodec2Model
 
model_path = "HKUSTAudio/xcodec2"  
 
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()   

input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.'
# input_text = '突然，身边一阵笑声。我看着他们，意气风发地挺直了胸膛，甩了甩那稍显肉感的双臂，轻笑道："我身上的肉，是为了掩饰我爆棚的魅力，否则，岂不吓坏了你们呢？"'
def ids_to_speech_tokens(speech_ids):
 
    speech_tokens_str = []
    for speech_id in speech_ids:
        speech_tokens_str.append(f"<|s_{speech_id}|>")
    return speech_tokens_str

def extract_speech_ids(speech_tokens_str):
 
    speech_ids = []
    for token_str in speech_tokens_str:
        if token_str.startswith('<|s_') and token_str.endswith('|>'):
            num_str = token_str[4:-2]

            num = int(num_str)
            speech_ids.append(num)
        else:
            print(f"Unexpected token: {token_str}")
    return speech_ids

#TTS start!
with torch.no_grad():
 
    formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

    # Tokenize the text
    chat = [
        {"role": "user", "content": "Convert the text to speech:" + formatted_text},
        {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
    ]

    input_ids = tokenizer.apply_chat_template(
        chat, 
        tokenize=True, 
        return_tensors='pt', 
        continue_final_message=True
    )
    input_ids = input_ids.to('cuda')
    speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')

    # Generate the speech autoregressively
    outputs = model.generate(
        input_ids,
        max_length=2048,  # We trained our model with a max length of 2048
        eos_token_id= speech_end_id ,
        do_sample=True,    
        top_p=1,           #  Adjusts the diversity of generated content
        temperature=0.8,   #  Controls randomness in output
    )
    # Extract the speech tokens
    generated_ids = outputs[0][input_ids.shape[1]:-1]

    speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)   

    # Convert  token <|s_23456|> to int 23456 
    speech_tokens = extract_speech_ids(speech_tokens)

    speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)

    # Decode the speech tokens to speech waveform
    gen_wav = Codec_model.decode_code(speech_tokens) 
 

sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)

高级用法

利用给定的语音提示进行语音合成：

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf

llasa_3b ='HKUSTAudio/Llasa-3B'

tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval() 
model.to('cuda')

from xcodec2.modeling_xcodec2 import XCodec2Model
 
model_path = "HKUSTAudio/xcodec2"  
 
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()   
# only 16khz speech support!
prompt_wav, sr = sf.read("太乙真人.wav")   # you can find wav in Files
#prompt_wav, sr = sf.read("Anna.wav") # English prompt
prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)  

prompt_text ="对，这就是我万人敬仰的太乙真人，虽然有点婴儿肥，但也掩不住我逼人的帅气。"
#promt_text = "A chance to leave him alone, but... No. She just wanted to see him again. Anna, you don't know how it feels to lose a sister. Anna, I'm sorry, but your father asked me not to tell you anything."
target_text = '突然，身边一阵笑声。我看着他们，意气风发地挺直了胸膛，甩了甩那稍显肉感的双臂，轻笑道："我身上的肉，是为了掩饰我爆棚的魅力，否则，岂不吓坏了你们呢？"'
#target_text = "Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me."
input_text = prompt_text   + target_text

def ids_to_speech_tokens(speech_ids):
 
    speech_tokens_str = []
    for speech_id in speech_ids:
        speech_tokens_str.append(f"<|s_{speech_id}|>")
    return speech_tokens_str

def extract_speech_ids(speech_tokens_str):
 
    speech_ids = []
    for token_str in speech_tokens_str:
        if token_str.startswith('<|s_') and token_str.endswith('|>'):
            num_str = token_str[4:-2]

            num = int(num_str)
            speech_ids.append(num)
        else:
            print(f"Unexpected token: {token_str}")
    return speech_ids

#TTS start!
with torch.no_grad():
    # Encode the prompt wav
    vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
    print("Prompt Vq Code Shape:", vq_code_prompt.shape )   

    vq_code_prompt = vq_code_prompt[0,0,:]
    # Convert int 12345 to token <|s_12345|>
    speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)

    formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

    # Tokenize the text and the speech prefix
    chat = [
        {"role": "user", "content": "Convert the text to speech:" + formatted_text},
        {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
    ]

    input_ids = tokenizer.apply_chat_template(
        chat, 
        tokenize=True, 
        return_tensors='pt', 
        continue_final_message=True
    )
    input_ids = input_ids.to('cuda')
    speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')

    # Generate the speech autoregressively
    outputs = model.generate(
        input_ids,
        max_length=2048,  # We trained our model with a max length of 2048
        eos_token_id= speech_end_id ,
        do_sample=True,
        top_p=1,           
        temperature=0.8,
    )
    # Extract the speech tokens
    generated_ids = outputs[0][input_ids.shape[1]-len(speech_ids_prefix):-1]

    speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)   

    # Convert  token <|s_23456|> to int 23456 
    speech_tokens = extract_speech_ids(speech_tokens)

    speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)

    # Decode the speech tokens to speech waveform
    gen_wav = Codec_model.decode_code(speech_tokens) 

    # if only need the generated part
    # gen_wav = gen_wav[:,:,prompt_wav.shape[1]:]

sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)

📚 详细文档

模型信息

我们的模型Llasa是一个文本转语音（TTS）系统，它通过结合来自XCodec2码本的语音标记，扩展了基于文本的LLaMA（1B、3B和8B）语言模型，该码本包含65,536个标记。我们在包含250,000小时中英语音数据的数据集上训练了Llasa。该模型能够仅从输入文本或利用给定的语音提示生成语音。

该方法与Llama框架无缝兼容，使得训练TTS与训练大语言模型（LLM）类似（将音频转换为单码本标记，并将其视为一种特殊语言）。这为将现有的用于LLM的压缩、加速和微调方法应用于TTS提供了可能性。

模型训练与测试

从头开始训练：如果您想从头开始训练模型，请使用 LLaSA Training Repository。
测试时计算扩展：如果您想尝试在测试时进行计算扩展，请使用 LLaSA Testing Repository。

模型支持情况

属性	详情
模型类型	文本转语音（TTS）
训练数据	250,000小时的中英语音数据
支持模型	Llasa-3B、Whisper Large V3、Qwen3 (14B)、Llama 3.2 Vision (11B)

模型性能

模型	免费笔记本	性能	内存使用
Llasa-3B	▶️ 在Colab上开始	快1.5倍	减少58%
Whisper Large V3	▶️ 在Colab上开始	快1.5倍	减少50%
Qwen3 (14B)	▶️ 在Colab上开始	快2倍	减少70%
Llama 3.2 Vision (11B)	▶️ 在Colab上开始	快1.8倍	减少50%

更新日志

2025-05-10：有时发现top_p=0.95和temperature=0.9会产生更稳定的结果。
2025-02-13：添加 Llasa微调说明。
2025-02-07：我们的论文已发布！LLaSA: Scaling Train-Time and Inference-Time Compute for LLaMA-based Speech Synthesis