whisper-small-egyptian-arabic开源模型 - 精准识别埃及阿拉伯语方言语音

首页

Whisper Small Egyptian Arabic

由 MAdel121 开发

这是一个针对埃及阿拉伯语方言进行微调的Whisper-small自动语音识别模型，基于SpeechBrain工具包训练

语音识别

Transformers

支持多种语言开源协议:MIT #埃及阿拉伯语ASR #方言语音识别 #低WER优化

下载量 196

发布时间 : 5/4/2025

模型简介

专门针对埃及阿拉伯语方言优化的自动语音识别模型，提升了该方言的识别准确率

模型特点

方言优化

专门针对埃及阿拉伯语方言进行微调，显著提升该方言的识别准确率

高效推理

基于Whisper-small架构，在保持较高准确率的同时实现快速推理

数据增强

训练中应用了多种数据增强技术(DropChunk/DropFreq/DropBitResolution)提高模型鲁棒性

模型能力

埃及阿拉伯语语音识别

长音频处理(支持分块处理)

实时转录

使用案例

语音转录

埃及方言语音转文字

将埃及阿拉伯语语音内容转换为文字

WER 22.69%, CER 16.70%

语音助手

埃及方言语音交互

为埃及用户提供本地化语音交互体验

🚀 针对埃及阿拉伯语自动语音识别微调的小型Whisper模型

本项目包含一个针对自动语音识别（ASR）专门微调的openai/whisper-small模型，目标方言为埃及阿拉伯语。该模型使用SpeechBrain工具包，在MAdel121/arabic-egy-cleaned数据集上进行了微调。

🚀 快速开始

你可以直接使用transformers库的管道进行自动语音识别。确保你已经安装了transformers和torch（pip install transformers torch）。

from transformers import pipeline
import torch

# 确保你已经安装了ffmpeg用于音频处理
# pip install -U ffmpeg-python # 或者通过系统包管理器安装

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# 将 "your-username/whisper-small-egyptian-arabic" 替换为Hub上的实际模型ID
pipe = pipeline(
    "automatic-speech-recognition",
    model="your-username/whisper-small-egyptian-arabic", # <<< 替换此处
    device=device
)

# 加载你的音频文件（需要ffmpeg）
# 对于本地文件：
audio_file = "/path/to/your/egyptian_arabic_audio.wav"
result = pipe(audio_file, chunk_length_s=30, batch_size=8) # 根据GPU内存调整batch_size

# 对于datasets库中的音频：
# from datasets import load_dataset
# ds = load_dataset("MAdel121/arabic-egy-cleaned", "default", split="test") # 示例
# sample = ds[0]["audio"]
# result = pipe(sample.copy()) # 传递副本以避免修改原始数据

print(result["text"])

# --- 使用AutoModelForSpeechSeq2Seq ---
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio

# 加载处理器和模型（替换为你的模型ID）
model_id = "your-username/whisper-small-egyptian-arabic" # <<< 用你在Hugging Face上的数据集文件替换此处
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)

# 加载并预处理音频
waveform, sample_rate = torchaudio.load(audio_file)
if sample_rate != processor.feature_extractor.sampling_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
    waveform = resampler(waveform)

input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)

# 生成转录
# 为阿拉伯语转录设置强制解码器ID
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

⚠️ 重要提示

原始检查点是使用SpeechBrain保存的。本README假设模型已转换为标准的Hugging Face Transformers格式，以便使用pipeline或AutoModel类进行托管和使用。如果你使用的是原始的.ckpt文件，请参考项目的主README.md和infer_whisper_local.py脚本获取加载说明。

✨ 主要特性

专门针对埃及阿拉伯语方言进行微调，提升该方言的语音识别效果。
使用SpeechBrain工具包进行微调，结合了Hugging Face Transformers和Accelerate框架。

📦 安装指南

确保你已经安装了以下依赖：

pip install transformers torch
pip install -U ffmpeg-python

💻 使用示例

基础用法

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    "automatic-speech-recognition",
    model="your-username/whisper-small-egyptian-arabic",
    device=device
)
audio_file = "/path/to/your/egyptian_arabic_audio.wav"
result = pipe(audio_file, chunk_length_s=30, batch_size=8)
print(result["text"])

高级用法

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio

model_id = "your-username/whisper-small-egyptian-arabic"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)

waveform, sample_rate = torchaudio.load(audio_file)
if sample_rate != processor.feature_extractor.sampling_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
    waveform = resampler(waveform)

input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

📚 详细文档

模型描述

属性	详情
基础模型	openai/whisper-small
语言	阿拉伯语 (`ar`)
任务	转录
微调框架	SpeechBrain
数据集	MAdel121/arabic-egy-cleaned

预期用途与限制

本模型旨在转录埃及阿拉伯语方言的语音。

限制：

在其他阿拉伯语方言上的性能可能会显著下降。
在嘈杂音频上的性能可能会有所不同，因为训练期间仅使用了特定的增强技术（DropChunk、DropFreq、DropBitResolution）。
模型在高度专业化的领域或微调数据集中未出现的主题上可能表现不佳。

训练数据

模型在Hugging Face Hub上的**MAdel121/arabic-egy-cleaned**数据集上进行了微调。该数据集包含埃及阿拉伯语的清理音频样本和相应的转录。

训练过程

框架：SpeechBrain (speechbrain==1.0.3) 结合Hugging Face Transformers (transformers==4.51.3) 和Accelerate (accelerate==0.25.0)。
基础模型：openai/whisper-small
数据集：MAdel121/arabic-egy-cleaned
轮数：10
优化器：AdamW (lr=1e-5, weight_decay=0.05)
学习率调度器：NewBob (improvement_threshold=0.0025, annealing_factor=0.9, patient=0)
热身步骤：1000
批次大小：8（固定，无动态批处理）
梯度累积：2步（有效批次大小：16）
梯度裁剪：最大范数5.0
混合精度：未明确提及，假设为FP32或由Accelerate/Trainer处理。
数据增强：启用 (augment_prob_master=0.5, min_augmentations=1, max_augmentations=3)，随机应用以下技术：
- DropChunk (length: 1600 - 4800 samples, count: 1 - 5)
- DropFreq (count: 1 - 3)
- DropBitResolution
训练环境：Modal Labs (gpu=A100 - 40GB)

评估结果

模型在MAdel121/arabic-egy-cleaned数据集的测试集上进行了评估。

指标	值 (%)
单词错误率 (WER)	22.69
字符错误率 (CER)	16.70

WER（单词错误率）和CER（字符错误率）越低越好。

训练结束时（第10轮）的验证指标：

验证WER：22.79%
验证CER：16.76%

引用

如果你使用此模型，请考虑引用原始的Whisper论文和使用的数据集：

@article{radford2023robust,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2023}
}

@misc{adel_mohamed_2024_12860997,
  author       = {Adel Mohamed},
  title        = {MAdel121/arabic-egy-cleaned},
  month        = jun,
  year         = 2024,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.12860997},
  url          = {https://doi.org/10.5281/zenodo.12860997}
}

@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and FranÃ§ois Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}