模型简介
模型特点
模型能力
使用案例
语言:
- 英语 许可证: MIT 库名称: transformers 标签:
- 音频
- 自动语音识别
- transformers.js 小部件:
- 示例标题: LibriSpeech 样本1 来源: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- 示例标题: LibriSpeech 样本2 来源: https://cdn-media.huggingface.co/speech_samples/sample2.flac 管道标签: 自动语音识别
这是一个针对医疗语音识别微调Distil-Whisper-Large的工作空间。模型会经常更新,因此如果你发现它对你的需求有用,请复制此空间以保留当前版本。
Distil-Whisper: distil-large-v3
Distil-Whisper在论文通过大规模伪标签实现稳健知识蒸馏中提出。
这是Distil-Whisper英语系列的第三个也是最后一个版本。它是OpenAI最新且性能最佳的Whisper模型Whisper large-v3的知识蒸馏版本。
与之前的Distil-Whisper模型相比,distil-large-v3的蒸馏过程经过调整,以在使用OpenAI的顺序长格式算法时提供更优的长格式转录准确性。
结果是一个蒸馏模型,在使用顺序和分块算法时,长格式音频的WER(词错误率)与large-v3相差不到1%,并且在使用顺序算法时比distil-large-v2高出4.8%。该模型也比之前的Distil-Whisper模型更快:比large-v3快6.3倍,比distil-large-v2快1.1倍。
模型 | 参数/M | 相对延迟 | 短格式 | 顺序长格式 | 分块长格式 |
---|---|---|---|---|---|
large-v3 | 1550 | 1.0 | 8.4 | 10.0 | 11.0 |
distil-large-v3 | 756 | 6.3 | 9.7 | 10.8 | 10.9 |
distil-large-v2 | 756 | 5.8 | 10.1 | 15.6 | 11.6 |
由于顺序算法是最流行的Whisper库(Whisper cpp、Faster-Whisper、OpenAI Whisper)中的“事实标准”转录算法,这个蒸馏模型设计为与这些库兼容。当你使用这些库时,从之前的Distil-Whisper检查点切换到distil-large-v3可以显著提升性能。为了方便起见,最流行库的权重已经转换,下面有入门指南。
目录
Transformers使用
distil-large-v3从4.39版本开始支持Hugging Face 🤗 Transformers库。要运行模型,首先安装最新版本的Transformers。在这个例子中,我们还将安装🤗 Datasets以从Hugging Face Hub加载一个玩具音频数据集:
pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]
短格式转录
该模型可以与pipeline
类一起使用,转录短格式音频文件(<30秒)如下:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
要转录本地音频文件,只需在调用pipeline时传递音频文件的路径:
- result = pipe(sample)
+ result = pipe("audio.mp3")
对于分段级时间戳,传递参数return_timestamps=True
并返回"chunks"
输出:
result = pipe(sample, return_timestamps=True)
print(result["chunks"])
为了更精细地控制生成参数,直接使用模型+处理器API:
临时生成参数可以传递给model.generate
,包括用于束搜索的num_beams
,用于分段级时间戳的return_timestamps
,以及用于提示的prompt_ids
。更多细节请参阅文档字符串。
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
input_features = input_features.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 128,
"num_beams": 1,
"return_timestamps": False,
}
pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
print(pred_text)
顺序长格式
与之前的Distil-Whisper版本不同,distil-large-v3专门设计为与OpenAI的顺序长格式转录算法兼容。该算法使用滑动窗口对长音频文件(>30秒)进行缓冲推理,并返回比分块长格式算法更准确的转录。
在以下任一情况下应使用顺序长格式算法:
- 转录准确性是最重要的因素,延迟不太重要
- 你正在转录批量的长音频文件,在这种情况下,顺序的延迟与分块相当,同时准确率高出0.5% WER
如果你正在转录单个长音频文件且延迟是最重要的因素,则应使用下面描述的分块算法。关于不同算法的详细解释,请参阅Distil-Whisper论文的第5节。
pipeline
类可以用于使用顺序算法转录长音频文件,如下所示:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
为了更精细地控制生成参数,直接使用模型+处理器API:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
inputs = processor(
sample["array"],
sampling_rate=sample["sampling_rate"],
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib压缩比率阈值(在token空间中)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)
分块长格式
distil-large-v3仍然兼容Transformers的分块长格式算法。当需要转录单个大型音频文件且需要最快的推理速度时,应使用此算法。在这种情况下,分块算法比OpenAI的顺序长格式实现快达9倍(参见Distil-Whisper论文的表7)。
要启用分块,将chunk_length_s
参数传递给pipeline
。对于distil-large-v3,25秒的分块长度是最优的。要激活长音频文件的批处理,传递参数batch_size
:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pret



