wav2vec2-base-10k-voxpopuli-ft-en开源模型 - 免费部署实现精准英语语音识别

首页

Wav2vec2 Base 10k Voxpopuli Ft En

由 facebook 开发

基于VoxPopuli语料库10K未标记子集预训练并在英语转录数据上微调的Wav2Vec2基础模型，适用于英语语音识别任务。

语音识别

Transformers

英语#英语语音识别 #VoxPopuli微调 #无监督预训练

下载量 40

发布时间 : 3/2/2022

模型简介

该模型是Facebook的Wav2Vec2基础模型，经过VoxPopuli语料库预训练和英语转录数据微调，主要用于英语自动语音识别(ASR)任务。

模型特点

VoxPopuli预训练

基于VoxPopuli大规模多语言语音语料库的10K未标记子集进行预训练

英语转录微调

在英语转录数据上进行微调，优化英语语音识别性能

端到端语音识别

直接从原始音频输入生成文本输出，无需中间特征提取步骤

模型能力

英语语音识别

音频转录

自动语音转文本

使用案例

语音转录

会议记录

将英语会议录音自动转录为文字记录

播客转录

将英语播客内容转换为可搜索的文本格式

辅助技术

语音转文字工具

为听力障碍人士提供实时语音转文字服务

🚀 Wav2Vec2-Base-VoxPopuli-Finetuned

本项目基于 Facebook的Wav2Vec2 基础模型，该模型在 VoxPopuli语料库的10K无标签子集上进行预训练，并在英文转录数据上进行微调（更多信息请参考论文中的表1）。

✨ 主要特性

基于预训练的Wav2Vec2模型，在VoxPopuli语料库上进行微调，适用于英文语音识别任务。
可用于对 Common Voice数据集进行推理。

📚 详细文档

论文信息

论文标题：VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
作者：来自 Facebook AI 的 Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux

💻 使用示例

基础用法

#!/usr/bin/env python3
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torchaudio
import torch

# resample audio

# load model & processor
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-en")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-en")

# load dataset
ds = load_dataset("common_voice", "en", split="validation[:1%]")

# common voice does not match target sampling rate
common_voice_sample_rate = 48000
target_sample_rate = 16000

resampler = torchaudio.transforms.Resample(common_voice_sample_rate, target_sample_rate)


# define mapping fn to read in sound file and resample
def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    speech = resampler(speech)
    batch["speech"] = speech[0]
    return batch


# load all audio files
ds = ds.map(map_to_array)

# run inference on the first 5 data samples
inputs = processor(ds[:5]["speech"], sampling_rate=target_sample_rate, return_tensors="pt", padding=True)

# inference
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, axis=-1)

print(processor.batch_decode(predicted_ids))