Whisper Small Egyptian Arabic
模型简介
模型特点
模型能力
使用案例
🚀 针对埃及阿拉伯语自动语音识别微调的小型Whisper模型
本项目包含一个针对自动语音识别(ASR)专门微调的openai/whisper-small
模型,目标方言为埃及阿拉伯语。该模型使用SpeechBrain工具包,在MAdel121/arabic-egy-cleaned
数据集上进行了微调。
🚀 快速开始
你可以直接使用transformers
库的管道进行自动语音识别。确保你已经安装了transformers
和torch
(pip install transformers torch
)。
from transformers import pipeline
import torch
# 确保你已经安装了ffmpeg用于音频处理
# pip install -U ffmpeg-python # 或者通过系统包管理器安装
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# 将 "your-username/whisper-small-egyptian-arabic" 替换为Hub上的实际模型ID
pipe = pipeline(
"automatic-speech-recognition",
model="your-username/whisper-small-egyptian-arabic", # <<< 替换此处
device=device
)
# 加载你的音频文件(需要ffmpeg)
# 对于本地文件:
audio_file = "/path/to/your/egyptian_arabic_audio.wav"
result = pipe(audio_file, chunk_length_s=30, batch_size=8) # 根据GPU内存调整batch_size
# 对于datasets库中的音频:
# from datasets import load_dataset
# ds = load_dataset("MAdel121/arabic-egy-cleaned", "default", split="test") # 示例
# sample = ds[0]["audio"]
# result = pipe(sample.copy()) # 传递副本以避免修改原始数据
print(result["text"])
# --- 使用AutoModelForSpeechSeq2Seq ---
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
# 加载处理器和模型(替换为你的模型ID)
model_id = "your-username/whisper-small-egyptian-arabic" # <<< 用你在Hugging Face上的数据集文件替换此处
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
# 加载并预处理音频
waveform, sample_rate = torchaudio.load(audio_file)
if sample_rate != processor.feature_extractor.sampling_rate:
resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
waveform = resampler(waveform)
input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)
# 生成转录
# 为阿拉伯语转录设置强制解码器ID
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
⚠️ 重要提示
原始检查点是使用SpeechBrain保存的。本README假设模型已转换为标准的Hugging Face Transformers格式,以便使用
pipeline
或AutoModel
类进行托管和使用。如果你使用的是原始的.ckpt
文件,请参考项目的主README.md
和infer_whisper_local.py
脚本获取加载说明。
✨ 主要特性
- 专门针对埃及阿拉伯语方言进行微调,提升该方言的语音识别效果。
- 使用SpeechBrain工具包进行微调,结合了Hugging Face Transformers和Accelerate框架。
📦 安装指南
确保你已经安装了以下依赖:
pip install transformers torch
pip install -U ffmpeg-python
💻 使用示例
基础用法
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="your-username/whisper-small-egyptian-arabic",
device=device
)
audio_file = "/path/to/your/egyptian_arabic_audio.wav"
result = pipe(audio_file, chunk_length_s=30, batch_size=8)
print(result["text"])
高级用法
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
model_id = "your-username/whisper-small-egyptian-arabic"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
waveform, sample_rate = torchaudio.load(audio_file)
if sample_rate != processor.feature_extractor.sampling_rate:
resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
waveform = resampler(waveform)
input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
📚 详细文档
模型描述
属性 | 详情 |
---|---|
基础模型 | openai/whisper-small |
语言 | 阿拉伯语 (ar ) |
任务 | 转录 |
微调框架 | SpeechBrain |
数据集 | MAdel121/arabic-egy-cleaned |
预期用途与限制
本模型旨在转录埃及阿拉伯语方言的语音。
限制:
- 在其他阿拉伯语方言上的性能可能会显著下降。
- 在嘈杂音频上的性能可能会有所不同,因为训练期间仅使用了特定的增强技术(DropChunk、DropFreq、DropBitResolution)。
- 模型在高度专业化的领域或微调数据集中未出现的主题上可能表现不佳。
训练数据
模型在Hugging Face Hub上的**MAdel121/arabic-egy-cleaned
**数据集上进行了微调。该数据集包含埃及阿拉伯语的清理音频样本和相应的转录。
训练过程
- 框架:SpeechBrain (
speechbrain==1.0.3
) 结合Hugging Face Transformers (transformers==4.51.3
) 和Accelerate (accelerate==0.25.0
)。 - 基础模型:
openai/whisper-small
- 数据集:
MAdel121/arabic-egy-cleaned
- 轮数:10
- 优化器:AdamW (
lr=1e-5
,weight_decay=0.05
) - 学习率调度器:NewBob (
improvement_threshold=0.0025
,annealing_factor=0.9
,patient=0
) - 热身步骤:1000
- 批次大小:8(固定,无动态批处理)
- 梯度累积:2步(有效批次大小:16)
- 梯度裁剪:最大范数5.0
- 混合精度:未明确提及,假设为FP32或由Accelerate/Trainer处理。
- 数据增强:启用 (
augment_prob_master=0.5
,min_augmentations=1
,max_augmentations=3
),随机应用以下技术:- DropChunk (
length: 1600 - 4800 samples
,count: 1 - 5
) - DropFreq (
count: 1 - 3
) - DropBitResolution
- DropChunk (
- 训练环境:Modal Labs (
gpu=A100 - 40GB
)
评估结果
模型在MAdel121/arabic-egy-cleaned
数据集的测试集上进行了评估。
指标 | 值 (%) |
---|---|
单词错误率 (WER) | 22.69 |
字符错误率 (CER) | 16.70 |
WER(单词错误率)和CER(字符错误率)越低越好。
训练结束时(第10轮)的验证指标:
- 验证WER:22.79%
- 验证CER:16.76%
引用
如果你使用此模型,请考虑引用原始的Whisper论文和使用的数据集:
@article{radford2023robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2023}
}
@misc{adel_mohamed_2024_12860997,
author = {Adel Mohamed},
title = {MAdel121/arabic-egy-cleaned},
month = jun,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.12860997},
url = {https://doi.org/10.5281/zenodo.12860997}
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}
🔧 技术细节
本模型基于openai/whisper-small
进行微调,使用SpeechBrain工具包结合Hugging Face Transformers和Accelerate框架。在训练过程中,采用了AdamW优化器和NewBob学习率调度器,同时进行了数据增强以提高模型的鲁棒性。
📄 许可证
本项目采用MIT许可证。
模型卡片作者
[你的姓名/组织名称]
(基于训练运行 ceeu3g6c
)



