whisper-large-v2-french开源法语语音识别模型 - 超2200小时音频训练精准识别

首页

Whisper Large V2 French

由 bofenghuang 开发

基于openai/whisper-large-v2微调的法语语音识别模型，训练数据包含超过2200小时的法语语音音频

语音识别

Transformers

法语开源协议:Apache-2.0 #法语语音识别 #低词错误率 #多数据集训练

下载量 103

发布时间 : 1/11/2023

模型简介

本模型是针对法语自动语音识别(ASR)任务优化的版本，在多个法语语音数据集上表现出色，不预测大小写或标点符号。

模型特点

多数据集训练

融合了Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli等多个高质量法语语音数据集

高性能

在多个测试集上词错误率(WER)显著低于基础模型

广泛适用性

支持标准法语和非洲口音法语识别

模型能力

法语语音转文本

高准确率语音识别

处理不同口音的法语

使用案例

语音转录

法语会议记录

将法语会议录音转换为文字记录

词错误率低于9%

法语媒体内容字幕生成

为法语视频自动生成字幕

在标准法语内容上词错误率约5%

语音助手

法语语音指令识别

用于法语语音助手或智能家居系统的语音指令识别

在多种口音上表现良好

🚀 用于法语自动语音识别的微调版whisper-large-v2模型

该模型是 openai/whisper-large-v2 的微调版本，在包含超过2200小时法语语音音频的复合数据集上进行训练。这些数据集来自 Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli、Fleurs、Multilingual TEDx、MediaSpeech 和 African Accented French 的训练集和验证集。使用该模型时，请确保语音输入的采样率为16Khz。该模型不会预测大小写或标点符号。

🚀 快速开始

本模型可用于法语的自动语音识别任务。使用前需确保语音输入采样率为16Khz，且该模型不会预测大小写和标点符号。

✨ 主要特性

基于微调的 openai/whisper-large-v2 模型，在多数据集上训练，提升法语语音识别能力。
可使用 🤗 Pipeline 或 🤗 底层API进行推理。

📦 安装指南

文档未提及安装步骤，可参考 🤗 Transformers 库的安装方式。

💻 使用示例

基础用法

使用 🤗 Pipeline 进行推理：

import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-large-v2-french", device=device)

# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]  # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"]  # beam search

# Normalise predicted sentences if necessary

高级用法

使用 🤗 底层API进行推理：

import torch
import torchaudio

from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-large-v2-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-large-v2-french", language="french", task="transcribe")

# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)

# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)  # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5)  # beam search

# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Normalise predicted sentences if necessary

📚 详细文档

性能表现

预训练模型的WER

以下是预训练模型在 Common Voice 9.0、Multilingual LibriSpeech、Voxpopuli 和 Fleurs 上的字错率（WER）。这些结果来自原始论文。

模型	Common Voice 9.0	MLS	VoxPopuli	Fleurs
openai/whisper-small	22.7	16.2	15.7	15.0
openai/whisper-medium	16.0	8.9	12.2	8.7
openai/whisper-large	14.7	8.9	11.0	7.7
openai/whisper-large-v2	13.9	7.3	11.4	8.3

微调模型的WER

以下是微调模型在 Common Voice 11.0、Multilingual LibriSpeech、Voxpopuli 和 Fleurs 上的字错率（WER）。请注意，这些评估数据集经过过滤和预处理，仅包含法文字符，并去除了撇号以外的标点符号。表格中的结果以 WER (贪心搜索) / WER (束宽为5的束搜索) 形式呈现。

模型	Common Voice 11.0	MLS	VoxPopuli	Fleurs
bofenghuang/whisper-small-cv11-french	11.76 / 10.99	9.65 / 8.91	14.45 / 13.66	10.76 / 9.83
bofenghuang/whisper-medium-cv11-french	9.03 / 8.54	6.34 / 5.86	11.64 / 11.35	7.13 / 6.85
bofenghuang/whisper-medium-french	9.03 / 8.73	4.60 / 4.44	9.53 / 9.46	6.33 / 5.94
bofenghuang/whisper-large-v2-cv11-french	8.05 / 7.67	5.56 / 5.28	11.50 / 10.69	5.42 / 5.05
bofenghuang/whisper-large-v2-french	8.15 / 7.83	4.20 / 4.03	9.10 / 8.66	5.22 / 4.98