库名称:transformers
许可证:apache-2.0
语言:
Monsoon-Whisper-Medium-Gigaspeech2
Monsoon-Whisper-Medium-GigaSpeech2 是一款🇹🇭泰语自动语音识别(ASR)模型。它基于Whisper-Medium并在GigaSpeech2数据集上进行了微调。
最初作为研究ASR任务中涌现能力的规模实验而开发。该模型在真实场景中表现优异,包括来自YouTube的音频和嘈杂环境下的语音识别。
更多细节请参阅我们的Typhoon-Audio发布博客。
模型描述
- 模型类型:Whisper Medium
- 要求:transformers 4.38.0或更高版本
- 主要语言:泰语🇹🇭
- 许可证:Apache 2.0
使用示例
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch
model_path = "scb10x/monsoon-whisper-medium-gigaspeech2"
device = "cuda"
filepath = 'audio.wav'
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(
model_path, torch_dtype=torch.bfloat16
)
model.to(device)
model.eval()
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
language="th", task="transcribe"
)
array, sr = torchaudio.load(filepath)
input_features = (
processor(array, sampling_rate=sr, return_tensors="pt")
.to(device)
.to(torch.bfloat16)
.input_features
)
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
评估结果
模型 |
WER (GS2) |
WER (CV17) |
CER (GS2) |
CER (CV17) |
whisper-large-v3 |
37.02 |
22.63 |
24.03 |
8.49 |
whisper-medium |
55.64 |
43.01 |
37.55 |
16.41 |
biodatlab-whisper-th-medium-combined |
31.00 |
14.25 |
21.20 |
5.69 |
biodatlab-whisper-th-large-v3-combined |
29.02 |
15.72 |
19.96 |
6.32 |
monsoon-whisper-medium-gigaspeech2 |
22.74 |
20.79 |
14.15 |
6.92 |
预期用途与限制
本模型为实验性质,可能无法保证完全准确。开发者需根据具体应用场景谨慎评估潜在风险。
关注与支持
- https://twitter.com/opentyphoon
- https://discord.gg/us5gAYmrxw
Typhoon团队
Kunat Pipatanakul、Potsawee Manakul、Sittipong Sripaisarnmongkol、Natapong Nitarach、Warit Sirichotedumrong、Adisai Na-Thalang、Phatrasek Jirabovonvisut、Parinthapat Pengpun、
Krisanapong Jirayoot、Pathomporn Chokchainant、Kasima Tharnpipitchai