语言:
- 英语
- 法语
- 西班牙语
- 德语
许可证: mit
库名称: transformers
标签:
- 音频
- 自动语音识别
- transformers.js
小部件:
- 示例标题: LibriSpeech样本1
来源: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- 示例标题: LibriSpeech样本2
来源: https://cdn-media.huggingface.co/speech_samples/sample2.flac
管道标签: 自动语音识别
Whisper-Large-V3-Distil-Multi4-v0.2
这是一个多语言蒸馏版的Whisper模型,具有2个解码器层,支持4种欧洲语言:英语、法语、西班牙语和德语。
该模型是在我开发Distil-Large-v3.5期间训练的。
一个显著特点是其原生支持语码转换。该模型能够在单个片段转录中切换语言,当检测到语言变化时自动生成新的语言标记(如下例所示)。
在训练过程中,<|yue|>
语言标记被重新用于充当自动语言检测标记,从而在推理过程中实现语码转换。要使用此功能,只需将语言参数设置为cantonese
(默认使用)。
该模型的性能低于单语言蒸馏版本和Whisper-Large-v3-Turbo。未来的工作应研究更好的训练程序,并可能纳入更多数据,以有效地将多语言能力压缩到单个模型中。
目录
使用
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi4-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]
print(text)
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
predicted_ids = model.generate(
input_features.to(device, dtype=torch_dtype),
max_new_tokens=128,
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
评估
英语
模型 |
LIUM_tedlium |
mcv17 |
voxpopuli |
fleurs |
kensho_spgispeech |
librispeech-test_clean |
librispeech-test_other |
speechcolab_gigaspeech |
openai/whisper-large-v3 |
10.58 |
10.13 |
8.93 |
5.72 |
2.95 |
1.87 |
3.58 |
10.07 |
openai/whisper-large-v3-turbo |
10.20 |
11.74 |
11.78 |
6.13 |
2.95 |
1.98 |
3.94 |
10.11 |
distil-whisper/distil-large-v3 |
8.93 |
12.41 |
7.72 |
7.59 |
3.25 |
2.42 |
5.11 |
10.08 |
distil-whisper/distil-large-v3.5 |
8.65 |
11.07 |
7.54 |
6.74 |
2.86 |
2.28 |
4.94 |
9.84 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
8.88 |
11.33 |
7.60 |
6.97 |
3.03 |
2.51 |
5.24 |
10.12 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
9.36 |
11.32 |
7.65 |
7.02 |
2.99 |
2.46 |
5.24 |
10.06 |
法语
模型 |
mcv17 |
mls |
voxpopuli |
mtedx |
af_accented |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
10.98 |
4.69 |
11.15 |
8.67 |
7.51 |
5.4 |
9.87 |
8.97 |
9 |
8.01 |
openai/whisper-large-v3-turbo |
12.41 |
5.1 |
12.21 |
9.87 |
8.37 |
5.48 |
10.12 |
9 |
8.49 |
8.39 |
bofenghuang/whisper_large_v3_distil_fr_v0.2 |
11.1 |
5 |
10.68 |
8.75 |
7.09 |
6.35 |
9.44 |
9.84 |
8.94 |
8.93 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
11.96 |
6.04 |
11.07 |
9.16 |
7.99 |
7.10 |
10.42 |
12.61 |
9.06 |
11.75 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
12.19 |
6.2 |
11.29 |
9.13 |
8.26 |
7.17 |
10.04 |
12.26 |
8.93 |
11.56 |
西班牙语
模型 |
mcv17 |
mls |
voxpopuli |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
4.91 |
3.97 |
11.06 |
6.52 |
4.22 |
10.85 |
10.36 |
5.90 |
5.22 |
openai/whisper-large-v3-turbo |
5.74 |
4.41 |
16.02 |
6.66 |
4.59 |
11.55 |
10.68 |
6.46 |
5.41 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
5.58 |
4.34 |
8.52 |
7.43 |
5.20 |
11.26 |
13.43 |
5.69 |
8.95 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
5.70 |
4.35 |
8.55 |
7.56 |
5.15 |
11.45 |
13.54 |
5.84 |
8.27 |
德语
模型 |
mcv17 |
mls |
voxpopuli |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
6.11 |
5.60 |
17.75 |
19.63 |
5.92 |
11.21 |
10.35 |
17.64 |
17.76 |
openai/whisper-large-v3-turbo |
7.45 |
6.43 |
20.48 |
20.00 |
6.45 |
10.57 |
9.70 |
18.04 |
18.37 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
7.31 |
6.45 |
12.41 |
21.48 |
8.20 |
11.04 |
13.55 |
19.54 |
21.76 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
7.57 |
6.67 |
12.42 |
21.95 |
8.28 |
11.21 |
13.84 |
19.90 |
21.67 |