语言:
- 俄语
许可证: apache-2.0
基础模型: openai/whisper-small
标签:
- 训练生成
数据集:
- bond005/sberdevices_golos_10h_crowd
模型索引:
- 名称: ru_whisper_small - Val123val
结果: []
ru_whisper_small - Val123val
该模型是基于openai/whisper-small在Sberdevices_golos_10h_crowd数据集上微调的版本。
模型描述
Whisper是一种基于Transformer的编码器-解码器模型,也称为序列到序列模型。它是在68万小时带有标签的语音数据上训练的,这些数据使用大规模弱监督进行标注。其中俄语仅占5千小时。ru_whisper_small是在Sberdevices_golos_10h_crowd数据集上对openai/whisper-small进行微调的版本。ru-whisper作为俄语语音识别的ASR解决方案,对开发者尤其有用。如果针对特定业务任务进行微调,它们可能展现出额外的能力。
预期用途与限制
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
processor = WhisperProcessor.from_pretrained("Val123val/ru_whisper_small")
model = WhisperForConditionalGeneration.from_pretrained("Val123val/ru_whisper_small")
model.config.forced_decoder_ids = None
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
长文本转录
Whisper模型本质上设计用于处理最长30秒的音频样本。然而,通过使用分块算法,它可以转录任意长度的音频样本。这可以通过Transformers的pipeline方法实现。分块通过在实例化pipeline时设置chunk_length_s=30来启用。启用分块后,pipeline可以进行批量推理。还可以通过传递return_timestamps=True来预测序列级时间戳:
import torch
from transformers import pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="Val123val/ru_whisper_small",
chunk_length_s=30,
device=device,
)
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
sample = ds[0]["audio"]
prediction = pipe(sample.copy(), batch_size=8)["text"]
prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
使用推测解码加速
推测解码由Google的Yaniv Leviathan等人在《通过推测解码实现Transformer快速推理》中提出。其原理是,一个更快的辅助模型经常会生成与更大的主模型相同的标记。
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
dataset = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
model_id = "Val123val/ru_whisper_small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="sdpa",
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
assistant_model_id = "openai/whisper-tiny"
assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
assistant_model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
attn_implementation="sdpa",
)
assistant_model.to(device);
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=15,
batch_size=4,
generate_kwargs={"assistant_model": assistant_model},
torch_dtype=torch_dtype,
device=device,
)
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
训练超参数
训练过程中使用了以下超参数:
- 学习率: 0.0001
- 训练批次大小: 32
- 评估批次大小: 16
- 随机种子: 42
- 优化器: Adam,参数为betas=(0.9,0.999)和epsilon=1e-08
- 学习率调度器类型: 线性
- 学习率调度器预热步数: 500
- 训练步数: 5000
框架版本
- Transformers 4.36.2
- Pytorch 2.1.0+cu121
- Datasets 2.16.0
- Tokenizers 0.15.0