CrisperWhisper开源语音识别模型 - 免费部署实现快速精准逐字转写

首页

Crisperwhisper

由 nyrahealth 开发

CrisperWhisper 是 OpenAI Whisper 的高级变体，专为快速、精确且逐字的语音识别设计，提供准确（清晰）的词级时间戳。

语音识别

Transformers

支持多种语言#逐字语音转录 #词级时间戳 #填充词检测

下载量 10.23k

发布时间 : 8/29/2024

模型简介

CrisperWhisper 是 OpenAI Whisper 的高级变体，专为快速、精确且逐字的语音识别设计，提供准确（清晰）的词级时间戳。与原始 Whisper 倾向于省略不流畅部分并采用更偏向意译的转录风格不同，CrisperWhisper 旨在精确转录每一个口语词汇，包括填充词、停顿、口吃和错误的开头。

模型特点

精确的词级时间戳

通过调整分词器并在训练中使用自定义注意力损失，即使在处理不流畅和停顿时也能提供精确的时间戳。

逐字转录

准确转录每一个口语词汇，包括并区分填充词如“um”和“uh”。

填充词检测

检测并准确转录填充词。

减少幻觉

最小化转录中的幻觉，提高准确性。

模型能力

语音识别

词级时间戳生成

填充词检测

多语言支持

使用案例

语音转录

会议记录

用于精确记录会议内容，包括所有不流畅部分和填充词。

提供逐字转录和精确的时间戳。

学术研究

用于转录访谈和研究数据，确保所有口语细节被准确记录。

高准确率的逐字转录。

语音分析

语音行为分析

分析说话者的不流畅模式和填充词使用情况。

提供详细的语音行为数据。

🚀 CrisperWhisper

CrisperWhisper 是 OpenAI 的 Whisper 的高级变体，专为快速、精确且逐字的语音识别而设计，能够提供准确（清晰）的词级时间戳。与原始的 Whisper 不同，原始 Whisper 倾向于省略不流畅的表达，更遵循一种预期的转录风格，而 CrisperWhisper 旨在精确转录每一个 spoken word，包括填充词、停顿、结巴和错误开头。查看我们的仓库以获取更多详细信息：https://github.com/nyrahealth/CrisperWhisper

✨ 主要特性

🎯 准确的词级时间戳：通过在训练期间使用调整后的分词器和自定义注意力损失，即使在不流畅表达和停顿处也能提供精确的时间戳。
📝 逐字转录：精确转录每一个 spoken word，包括并区分“um”和“uh”等填充词。
🔍 填充词检测：检测并准确转录填充词。
🛡️ 减少幻觉转录：最大限度地减少转录幻觉，提高准确性。

📚 详细文档

亮点

🏆 在 OpenASR 排行榜的逐字数据集（TED、AMI）中获得 第一名。
🎓 被 INTERSPEECH 2024 接受。
📄 论文发布：查看我们的论文，了解我们调整分词器的详细信息和原因。
✨ 新特性：论文中未提及的是，我们添加了注意力损失以进一步提高时间戳的准确性。通过专门为使用时间戳数据进行 DTW 对齐的注意力分数添加损失，我们显著提高了对齐性能。

性能概述

定性性能概述

音频	Whisper Large V3	Crisper Whisper
Demo de 1	Er war kein Genie, aber doch ein fähiger Ingenieur.	Es ist zwar kein. Er ist zwar kein Genie, aber doch ein fähiger Ingenieur.
Demo de 2	Leider müssen wir in diesen schweren Zeiten auch unserem Tagesgeschäft nachgehen. Der hier vorgelegte Kulturhaushalt der Ampelregierung strebt an, den Erfolgskurs der Union zumindest fiskalisch fortzuführen.	Leider [UH] müssen wir in diesen [UH] schweren Zeiten auch [UH] unserem [UH] Tagesgeschäft nachgehen. Der hier [UH] vorgelegte [UH] Kulturhaushalt der [UH] Ampelregierung strebt an, den [UH] Erfolgskurs der Union [UH] zumindest [UH] fiskalisch fortzuführen. Es.
Demo de 3	die über alle FRA-Fraktionen hinweg gut im Blick behalten sollten, auch weil sie teilweise sehr teeteuer sind. Aber nicht nur, weil sie teeteuer sind. Wir steigen mit diesem Endentwurf ein in die sogenannten Pandemie-Bereitschaftsverträge.	Die über alle Fr Fraktionen hinweg gut im [UH] Blick behalten sollten, auch weil sie teil teilweise sehr te teuer sind. Aber nicht nur, weil sie te teuer sind. Wir [UH] steigen mit diesem Ent Entwurf ein in die sogenannten Pand Pandemiebereitschaftsverträge.
Demo en 1	alternative is you can get like, you have those Dr. Bronner's	Alternative is you can get like [UH] you have those, you know, those doctor Brahmer's.
Demo en 2	influence our natural surrounding? How does it influence our ecosystem?	Influence our [UM] our [UH] our natural surrounding. How does it influence our ecosystem?
Demo en 3	and always find a place on the street to park and it was easy and you weren't a long distance away from wherever it was that you were trying to go. So I remember that being a lot of fun and easy to do and there were nice places to go and good events to attend. Come downtown and you had the Warner Theater and	And always find a place on the street to park. And and it was it was easy and you weren't a long distance away from wherever it was that you were trying to go. So, I I I remember that being a lot of fun and easy to do and there were nice places to go and, [UM] i good events to attend. Come downtown and you had the Warner Theater and, [UM]
Demo en 4	you know, more masculine, who were rough, and that definitely wasn't me. Then, you know, I was very smart because my father made sure I was smart, you know. So, you know, I hung around those people, you know. And then you had the ones that were just out doing things that they shouldn't have been doing also. So, yeah, I was in the little geek squad. You were in the little geek squad. Yeah.	you know, more masculine, who were rough, and that definitely wasn't me. Then, you know, I was very smart because my father made sure I was smart. You know, so, [UM] you know, I I hung around those people, you know. And then you had the ones that were just just out doing things that they shouldn't have been doing also. So yeah, I was the l I was in the little geek squad. Do you

定量性能概述

转录性能

CrisperWhisper 显著优于 Whisper Large v3，尤其是在真实标签具有更逐字转录风格的数据集上，如 AMI 和 TED-LIUM。

数据集	CrisperWhisper	Whisper Large v3
AMI	8.72	16.01
Earnings22	12.37	11.3
GigaSpeech	10.27	10.02
LibriSpeech clean	1.74	2.03
LibriSpeech other	3.97	3.91
SPGISpeech	2.71	2.95
TED-LIUM	3.35	3.9
VoxPopuli	8.61	9.52
CommonVoice	8.19	9.67
平均 WER	6.66	7.7

分割性能

CrisperWhisper 在分割性能方面表现出色。这种性能差距在不流畅表达和停顿处尤为明显。下表使用了论文中定义的指标。对于此表，我们使用了 50ms 的容差。每个模型的头是使用 How? 部分中描述的方法选择的，并为每个模型选择了在不同头数量下获得最高 F1 分数的结果。

数据集	指标	CrisperWhisper	Whisper Large v2	Whisper Large v3
AMI IHM	F1 分数	0.79	0.63	0.66
	平均 IOU	0.67	0.54	0.53
Common Voice	F1 分数	0.80	0.42	0.48
	平均 IOU	0.70	0.32	0.43
TIMIT	F1 分数	0.69	0.40	0.54
	平均 IOU	0.56	0.32	0.43

如何实现？

我们在 Whisper 的交叉注意力分数上采用了流行的动态时间规整（DTW）方法，如我们的论文中详细描述的那样，以得出词级时间戳。通过利用我们的重新分词过程，这种方法使我们能够持续检测停顿。由于时间戳的准确性在很大程度上取决于 DTW 成本矩阵，进而取决于交叉注意力的质量，我们为选定的对齐头开发了一种专门的损失函数，以提高精度。

尽管由于在提交截止日期前无法完成实验和训练，这个损失函数未包含在原始论文中，但它已用于训练我们公开可用的模型。此损失的主要特点如下：

数据准备
- 我们使用了具有词级时间戳注释的数据集，如 AMI IHM 和 TIMIT，但还需要额外的带时间戳的数据。
- 为此，我们使用一个小的手动标注数据集验证了几个强制对齐工具的对齐准确性。
- 基于此验证，我们选择了 PyTorch CTC 对齐器从 CommonVoice 数据集中生成更多时间对齐的数据。
- 由于 PyTorch CTC 对齐器倾向于高估停顿持续时间，我们应用了我们论文中详细描述的相同停顿分割方法来纠正这些错误。使用我们的手动标注数据集证实了这种纠正的有效性。
令牌 - 单词对齐
- 由于我们论文中详细描述的重新分词，每个令牌要么是一个单词的一部分，要么是一个停顿/空格，但不会同时是两者。
- 因此，每个令牌可以清晰地对齐到一个单词或一个空格/停顿。
真实交叉注意力
- 我们将令牌的交叉注意力真实值定义为 L2 归一化向量，其中：
  - 值为 1 表示根据词级真实时间戳，该单词处于活动状态。
  - 值为 0 表示不应关注。
- 为了考虑真实时间戳中的小误差，我们在真实向量的两侧应用了 4 步（8 毫秒）的线性插值，从 0 平滑过渡到 1。
损失计算
- 损失函数定义为预测的交叉注意力向量（在预测令牌时）与真实交叉注意力向量之间的 1 - 余弦相似度。
- 此损失在所有预测令牌和对齐头上求平均值。
对齐头选择
- 为了选择用于对齐的头，我们在带时间戳的 TIMIT 数据集上评估了每个单独的解码器注意力头的对齐性能。
- 我们选择了 15 个性能最佳的头，并使用我们的注意力损失对它们进行微调。
训练细节
- 由于我们训练期间的大多数样本短于 30 秒，我们以 50% 的概率移动音频样本和相应的时间戳真实值，以减轻交叉注意力对编码器输出早期位置的“过拟合”。
- 如果我们有超过 40 毫秒的静音（移动前后），我们在真实转录（和相应的交叉注意力真实值）前添加一个空格，以便模型必须准确预测第一个单词的开始时间。
- 我们在训练期间使用 WavLM 增强，向音频波形添加随机语音样本或噪声，以总体提高转录的鲁棒性和对齐头的稳定性。
- 我们将交叉注意力向量中属于真实单词前 4 秒和后 4 秒的“预测”值裁剪为 0。这是为了降低交叉注意力向量的维度，从而在损失中强调重要位置的注意力，并最终用于对齐。
- 以 1% 的概率，我们使用仅包含噪声的样本，模型必须返回空预测以改善幻觉问题。
- 该模型在英语和德语数据集的混合上进行训练，因此我们仅保证在这些语言上有良好的性能。
- 该模型分三个阶段进行训练，在第一阶段，我们使用约 10000 小时的音频来调整 Whisper 以适应新的分词器。在第二阶段，我们仅使用以逐字方式转录的高质量数据集。最后，我们继续在这个逐字混合数据集上训练，并添加注意力损失再训练 6000 步。

📦 安装指南

首先安装我们定制的 transformers 分支，以获得最准确的时间戳：

pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper

💻 使用示例

基础用法

import os
import sys
import torch

from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

def adjust_pauses_for_hf_pipeline_output(pipeline_output, split_threshold=0.12):
    """
    Adjust pause timings by distributing pauses up to the threshold evenly between adjacent words.
    """

    adjusted_chunks = pipeline_output["chunks"].copy()

    for i in range(len(adjusted_chunks) - 1):
        current_chunk = adjusted_chunks[i]
        next_chunk = adjusted_chunks[i + 1]

        current_start, current_end = current_chunk["timestamp"]
        next_start, next_end = next_chunk["timestamp"]
        pause_duration = next_start - current_end

        if pause_duration > 0:
            if pause_duration > split_threshold:
                distribute = split_threshold / 2
            else:
                distribute = pause_duration / 2

            # Adjust current chunk end time
            adjusted_chunks[i]["timestamp"] = (current_start, current_end + distribute)

            # Adjust next chunk start time
            adjusted_chunks[i + 1]["timestamp"] = (next_start - distribute, next_end)
    pipeline_output["chunks"] = adjusted_chunks

    return pipeline_output


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "nyrahealth/CrisperWhisper"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps='word',
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
hf_pipeline_output = pipe(sample)
crisper_whisper_result = adjust_pauses_for_hf_pipeline_output(hf_pipeline_output)
print(crisper_whisper_result)