CrisperWhisper开源语音识别模型 - 免费部署，快速精准逐字记录语音！

首页

Crisperwhisper

由 unsloth 开发

CrisperWhisper是OpenAI Whisper的进阶版本，专为快速、精准且逐字记录的语音识别设计，提供准确（清晰）的词级时间戳。

语音识别

Transformers

支持多种语言#逐字语音转录 #精确时间戳 #填充词检测

下载量 50

发布时间 : 5/14/2025

模型简介

CrisperWhisper是Whisper的改进版本，旨在精确转录每一个口语词汇，包括填充词、停顿、结巴和错误起始，提供更准确的词级时间戳。

模型特点

精确词级时间戳

通过调整分词器及训练中定制注意力损失，提供精确时间戳，即使在不流畅和停顿处也能准确标记。

逐字转录

如实转录每个口语词汇，区分并记录如'um'和'uh'等填充词。

填充词检测

准确检测并转录填充词。

减少幻觉

最小化转录幻觉，提升准确性。

模型能力

语音识别

词级时间戳生成

填充词检测

多语言支持

使用案例

语音转录

会议记录

精确记录会议中的每一个词汇，包括停顿和填充词。

提供更完整的会议记录，便于后期分析。

访谈转录

转录访谈内容，保留所有口语特征。

更真实的访谈记录，便于研究口语表达。

语音分析

口语分析

分析口语中的填充词和停顿模式。

帮助语言学家研究口语特征。

🚀 CrisperWhisper

CrisperWhisper是OpenAI的Whisper的高级变体，专为快速、精确且逐字的语音识别而设计，能提供准确的词级时间戳。

🔗 相关链接

查看我们的集合：所有TTS模型上传
学习微调TTS模型：阅读我们的指南
了解Unsloth Dynamic 2.0：Unsloth Dynamic 2.0

🌟 Unsloth支持的模型

Unsloth支持的模型	免费笔记本	性能	内存使用
Orpheus-TTS	▶️ 在Colab上开始	快1.5倍	减少58%
Whisper Large V3	▶️ 在Colab上开始	快1.5倍	减少50%
Qwen3 (14B)	▶️ 在Colab上开始	快2倍	减少70%
Llama 3.2 Vision (11B)	▶️ 在Colab上开始	快1.8倍	减少50%

🔍 模型信息

属性	详情
许可证	cc-by-nc-4.0
基础模型	openai/whisper-large-v3、nyrahealth/CrisperWhisper
评估指标	cer、wer
任务类型	automatic-speech-recognition
库名称	transformers

🚀 快速开始

CrisperWhisper是OpenAI的Whisper的高级变体，旨在实现快速、精确且逐字的语音识别，并提供准确（清晰）的词级时间戳。与原始的Whisper不同，原始Whisper倾向于省略不流畅的表达，更遵循一种预期的转录风格，而CrisperWhisper旨在准确转录每一个 spoken word，包括填充词、停顿、结巴和错误开头。更多详情请查看我们的仓库：https://github.com/nyrahealth/CrisperWhisper

✨ 主要特性

🎯 准确的词级时间戳：通过在训练期间使用调整后的分词器和自定义注意力损失，即使在不流畅和停顿处也能提供精确的时间戳。
📝 逐字转录：准确转录每一个 spoken word，包括并区分填充词，如“um”和“uh”。
🔍 填充词检测：检测并准确转录填充词。
🛡️ 幻觉缓解：最大限度地减少转录幻觉，提高准确性。

📚 详细文档

🏆 亮点

在OpenASR排行榜的逐字数据集（TED、AMI）中获得第一名。
被INTERSPEECH 2024接收。
论文发布：查看我们的论文，了解我们调整分词器的详细信息和原因。
✨ 新特性：论文中未提及的是添加了注意力损失，以进一步提高时间戳的准确性。通过专门为使用时间戳数据进行DTW对齐的注意力分数添加损失，我们显著提高了对齐性能。

1️⃣ 性能概述

1.1 定性性能概述

音频	Whisper Large V3	Crisper Whisper
Demo de 1	Er war kein Genie, aber doch ein fähiger Ingenieur.	Es ist zwar kein. Er ist zwar kein Genie, aber doch ein fähiger Ingenieur.
Demo de 2	Leider müssen wir in diesen schweren Zeiten auch unserem Tagesgeschäft nachgehen. Der hier vorgelegte Kulturhaushalt der Ampelregierung strebt an, den Erfolgskurs der Union zumindest fiskalisch fortzuführen.	Leider [UH] müssen wir in diesen [UH] schweren Zeiten auch [UH] unserem [UH] Tagesgeschäft nachgehen. Der hier [UH] vorgelegte [UH] Kulturhaushalt der [UH] Ampelregierung strebt an, den [UH] Erfolgskurs der Union [UH] zumindest [UH] fiskalisch fortzuführen. Es.
Demo de 3	die über alle FRA-Fraktionen hinweg gut im Blick behalten sollten, auch weil sie teilweise sehr teeteuer sind. Aber nicht nur, weil sie teeteuer sind. Wir steigen mit diesem Endentwurf ein in die sogenannten Pandemie-Bereitschaftsverträge.	Die über alle Fr Fraktionen hinweg gut im [UH] Blick behalten sollten, auch weil sie teil teilweise sehr te teuer sind. Aber nicht nur, weil sie te teuer sind. Wir [UH] steigen mit diesem Ent Entwurf ein in die sogenannten Pand Pandemiebereitschaftsverträge.
Demo en 1	alternative is you can get like, you have those Dr. Bronner's	Alternative is you can get like [UH] you have those, you know, those doctor Brahmer's.
Demo en 2	influence our natural surrounding? How does it influence our ecosystem?	Influence our [UM] our [UH] our natural surrounding. How does it influence our ecosystem?
Demo en 3	and always find a place on the street to park and it was easy and you weren't a long distance away from wherever it was that you were trying to go. So I remember that being a lot of fun and easy to do and there were nice places to go and good events to attend. Come downtown and you had the Warner Theater and	And always find a place on the street to park. And and it was it was easy and you weren't a long distance away from wherever it was that you were trying to go. So, I I I remember that being a lot of fun and easy to do and there were nice places to go and, [UM] i good events to attend. Come downtown and you had the Warner Theater and, [UM]
Demo en 4	you know, more masculine, who were rough, and that definitely wasn't me. Then, you know, I was very smart because my father made sure I was smart, you know. So, you know, I hung around those people, you know. And then you had the ones that were just out doing things that they shouldn't have been doing also. So, yeah, I was in the little geek squad. You were in the little geek squad. Yeah.	you know, more masculine, who were rough, and that definitely wasn't me. Then, you know, I was very smart because my father made sure I was smart. You know, so, [UM] you know, I I hung around those people, you know. And then you had the ones that were just just out doing things that they shouldn't have been doing also. So yeah, I was the l I was in the little geek squad. Do you

1.2 定量性能概述

转录性能

CrisperWhisper在转录性能上显著优于Whisper Large v3，尤其是在真实标签采用逐字转录风格的数据集上，如AMI和TED-LIUM。

数据集	CrisperWhisper	Whisper Large v3
AMI	8.72	16.01
Earnings22	12.37	11.3
GigaSpeech	10.27	10.02
LibriSpeech clean	1.74	2.03
LibriSpeech other	3.97	3.91
SPGISpeech	2.71	2.95
TED-LIUM	3.35	3.9
VoxPopuli	8.61	9.52
CommonVoice	8.19	9.67
平均WER	6.66	7.7

分割性能

CrisperWhisper在分割性能上表现出色，尤其是在不流畅和停顿处的性能差距更为明显。

数据集	指标	CrisperWhisper	Whisper Large v2	Whisper Large v3
AMI IHM	F1 Score	0.79	0.63	0.66
	Avg IOU	0.67	0.54	0.53
Common Voice	F1 Score	0.80	0.42	0.48
	Avg IOU	0.70	0.32	0.43
TIMIT	F1 Score	0.69	0.40	0.54
	Avg IOU	0.56	0.32	0.43

💻 使用示例

基础用法

首先安装我们的自定义transformers分支，以获得最准确的时间戳：

pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper

高级用法

import os
import sys
import torch

from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

def adjust_pauses_for_hf_pipeline_output(pipeline_output, split_threshold=0.12):
    """
    Adjust pause timings by distributing pauses up to the threshold evenly between adjacent words.
    """

    adjusted_chunks = pipeline_output["chunks"].copy()

    for i in range(len(adjusted_chunks) - 1):
        current_chunk = adjusted_chunks[i]
        next_chunk = adjusted_chunks[i + 1]

        current_start, current_end = current_chunk["timestamp"]
        next_start, next_end = next_chunk["timestamp"]
        pause_duration = next_start - current_end

        if pause_duration > 0:
            if pause_duration > split_threshold:
                distribute = split_threshold / 2
            else:
                distribute = pause_duration / 2

            # Adjust current chunk end time
            adjusted_chunks[i]["timestamp"] = (current_start, current_end + distribute)

            # Adjust next chunk start time
            adjusted_chunks[i + 1]["timestamp"] = (next_start - distribute, next_end)
    pipeline_output["chunks"] = adjusted_chunks

    return pipeline_output


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "nyrahealth/CrisperWhisper"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps='word',
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
hf_pipeline_output = pipe(sample)
crisper_whisper_result = adjust_pauses_for_hf_pipeline_output(hf_pipeline_output)
print(crisper_whisper_result)

更多关于停顿分布逻辑的原因，请阅读我们的论文。

🔧 技术细节

实现方式

我们在Whisper的交叉注意力分数上采用了流行的动态时间规整（DTW）方法，详情见我们的论文，以得出词级时间戳。通过利用我们的重新分词过程，这种方法使我们能够一致地检测停顿。鉴于时间戳的准确性在很大程度上取决于DTW成本矩阵，进而取决于交叉注意力的质量，我们为选定的对齐头开发了一种专门的损失函数，以提高精度。

虽然由于在提交截止日期前无法完成实验和训练，这个损失函数未包含在原始论文中，但它已用于训练我们公开可用的模型。该损失的主要特点如下：

数据准备
- 我们使用了带有词级时间戳注释的数据集，如AMI IHM和TIMIT，但需要额外的时间戳数据。
- 为此，我们使用一个小的手动标注数据集验证了几个强制对齐工具的对齐准确性。
- 基于此验证，我们选择了PyTorch CTC对齐器从CommonVoice数据集中生成更多时间对齐的数据。
- 由于PyTorch CTC对齐器倾向于高估停顿持续时间，我们应用了我们论文中详细描述的相同停顿分割方法来纠正这些错误。使用我们的手动标注数据集确认了这种纠正的有效性。
令牌 - 单词对齐
- 由于我们论文中详细描述的重新分词，每个令牌要么是一个单词的一部分，要么是一个停顿/空格，但绝不会两者都是。
- 因此，每个令牌可以清晰地对齐到一个单词或一个空格/停顿。
真实交叉注意力
- 我们将令牌的交叉注意力真实值定义为L2归一化向量，其中：
  - 值为1表示根据词级真实时间戳，该单词处于活动状态。
  - 值为0表示不应关注。
- 为了考虑真实时间戳中的小误差，我们在真实向量的两侧应用4步（8毫秒）的线性插值，从0平滑过渡到1。
损失计算
- 损失函数定义为预测交叉注意力向量（预测令牌时）与真实交叉注意力向量之间的1 - 余弦相似度。
- 该损失在所有预测令牌和对齐头之间求平均值。
对齐头选择
- 为了选择对齐头，我们在带时间戳的timit数据集上评估了每个单独的解码器注意力头的对齐性能。
- 我们选择了性能最好的15个头，并使用我们的注意力损失对它们进行微调。
训练细节
- 由于训练期间我们的大多数样本短于30秒，我们以50%的概率移动音频样本和相应的时间戳真实值，以减轻交叉注意力对编码器输出早期位置的“过拟合”。
- 如果我们有超过40毫秒的静音（移动前后），我们在真实转录（和相应的交叉注意力真实值）前添加一个空格，以便模型必须准确预测第一个单词的起始时间。
- 我们在训练期间使用WavLM增强，向音频波形添加随机语音样本或噪声，以总体提高转录的鲁棒性和对齐头的稳定性。
- 我们将交叉注意力向量中属于真实单词前4秒和后4秒的“预测”值裁剪为0。这是为了降低交叉注意力向量的维度，从而在损失中强调重要的注意力，最终用于对齐。
- 以1%的概率，我们使用仅包含噪声的样本，模型必须返回空预测，以改善幻觉。
- 该模型在英语和德语数据集的混合上进行训练，因此我们仅保证在这些语言上有良好的性能。
- 该模型分三个阶段进行训练，第一阶段我们使用约10000小时的音频来使Whisper适应新的分词器。第二阶段我们仅使用逐字转录的高质量数据集。最后，我们继续在这个逐字混合数据集上训练，并添加注意力损失进行另外6000步的训练。