开源CSM - 1b - safetensors - quants语音模型，支持文本音频输入生成RVQ音频编码

首页

Csm 1b Safetensors Quants

由 lunahr 开发

CSM（对话语音模型）是Sesame开发的10亿参数语音生成模型，可通过文本和音频输入生成RVQ音频编码。

语音合成

Transformers

英语开源协议:Apache-2.0 #对话语音生成 #多说话人支持 #上下文感知合成

下载量 37

发布时间 : 3/15/2025

模型简介

基于Llama主干网络和轻量级音频解码器的语音生成模型，支持文本转语音功能，输出Mimi音频编码。

模型特点

多说话人支持

可通过speaker参数控制不同说话人音色

上下文感知生成

支持通过上下文音频片段增强生成效果

安全张量格式

支持多种安全张量格式并统计下载量

模型能力

文本转语音

多说话人语音生成

上下文感知语音合成

使用案例

语音交互

对话系统语音输出

与LLM结合构建完整对话系统

交互式语音演示已在博客展示

内容创作

有声内容生成

自动生成播客、有声书等语音内容

🚀 CSM 1B (Safetensors)

CSM 1B (Safetensors) 是将原始版本转换为各种 Safetensors 格式的模型，同时还会跟踪下载情况。

2025/03/13 - 我们发布了 1B 版本的 CSM 变体。代码可在 GitHub 上获取：SesameAILabs/csm。

🚀 快速开始

CSM（对话语音模型，Conversational Speech Model）是来自 Sesame 的语音生成模型，它可以根据文本和音频输入生成 RVQ 音频代码。该模型架构采用了 Llama 主干和一个较小的音频解码器，用于生成 Mimi 音频代码。

经过微调的 CSM 变体为我们博客文章中展示的交互式语音演示提供支持。

此外，还有一个托管的 HuggingFace 空间可用于测试音频生成。

📦 安装指南

设置仓库：

python -m venv .venv
source .venv/bin/activate
curl -s -L https://raw.githubusercontent.com/SesameAILabs/csm/refs/heads/main/requirements.txt | pip install -r /dev/stdin

# You will need access to sesame/csm-1b and meta-llama/Llama-3.2-1B
huggingface-cli login

💻 使用示例

基础用法

生成一个句子：

from generator import load_csm_1b
import torchaudio

generator = load_csm_1b(device="cuda")

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

高级用法

CSM 在提供上下文时效果最佳。你可以为每个说话者的话语使用 Segment 为模型提供提示或上下文：

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)