bp400-xlsr开源语音识别模型 - 免费部署支持巴西葡萄牙语自动识别

首页

Bp400 Xlsr

由 lgris 开发

基于巴西葡萄牙语数据集微调的Wav2vec 2.0语音识别模型，支持巴西葡萄牙语自动语音识别任务。

语音识别

Transformers

其他开源协议:Apache-2.0 #巴西葡萄牙语语音识别 #多数据集训练 #低WER

下载量 55

发布时间 : 3/2/2022

模型简介

该模型是针对巴西葡萄牙语优化的自动语音识别(ASR)系统，基于Wav2vec 2.0架构，在多个巴西葡萄牙语数据集上进行了微调。

模型特点

多数据集训练

模型融合了7个巴西葡萄牙语数据集，包括CETUC、Common Voice等，总计超过400小时的训练数据。

语言模型支持

可结合4-gram语言模型进一步提升识别准确率，平均WER从12.4%降至10.5%。

高准确率

在多个测试集上表现优异，CETUC测试集WER低至3.0%，Common Voice测试集WER为9.6%。

模型能力

巴西葡萄牙语语音识别

音频转录

语音转文本

使用案例

语音转录

巴西葡萄牙语语音转录

将巴西葡萄牙语语音内容转换为文本

在CETUC数据集上达到3.0% WER的高准确率

语音助手

巴西葡萄牙语语音指令识别

用于巴西葡萄牙语语音助手系统中的指令识别

🚀 bp400-xlsr：基于巴西葡萄牙语（BP）数据集的Wav2vec 2.0模型

本项目展示了一个针对巴西葡萄牙语微调的Wav2vec模型，使用了以下数据集：

CETUC：包含约145小时的巴西葡萄牙语语音，由50名男性和50名女性发音人朗读，每人朗读约1000个从CETEN - Folha语料库中选出的语音平衡句子。
Common Voice 7.0：由Mozilla基金会发起的项目，旨在创建多种语言的开放数据集。在该项目中，志愿者通过官方网站捐赠并验证语音数据。
[Lapsbm](https://github.com/falabrasil/gitlab - resources)：“Falabrasil - UFPA”是Fala Brasil团队用于评估巴西葡萄牙语自动语音识别（ASR）系统的数据集。包含35名发音人（其中10名女性），每人朗读20个独特的句子，总计700条巴西葡萄牙语语音。音频以22.05 kHz录制，未进行环境控制。
Multilingual Librispeech (MLS)：一个多语言的大规模数据集，基于LibriVox等公共领域的有声读物录制。该数据集包含多种语言的总计6000小时转录数据。本项目使用的葡萄牙语数据集（主要是巴西变体）约有284小时语音，来自62名发音人朗读的55本有声读物。
Multilingual TEDx：包含8种源语言的TEDx演讲音频记录。其中的葡萄牙语数据集（主要是巴西葡萄牙语变体）包含164小时的转录语音。
Sidney (SID)：包含72名发音人（20名女性）录制的5777条语音，发音人年龄在17至59岁之间，数据集还包含发音人的出生地、年龄、性别、教育程度和职业等信息。
VoxForge：旨在为声学模型构建开放数据集的项目。该语料库包含约100名发音人和4130条巴西葡萄牙语语音，采样率从16kHz到44.1kHz不等。

这些数据集被合并以构建一个更大的巴西葡萄牙语数据集。除了Common Voice的开发集和测试集分别用于验证和测试外，所有数据都用于训练。我们还为所有收集的数据集创建了测试集。

数据集	训练集时长	验证集时长	测试集时长
CETUC	93.9h	--	5.4h
Common Voice	37.6h	8.9h	9.5h
LaPS BM	0.8h	--	0.1h
MLS	161.0h	--	3.7h
Multilingual TEDx (Portuguese)	144.2h	--	1.8h
SID	5.0h	--	1.0h
VoxForge	2.8h	--	0.1h
总计	437.2h	8.9h	21.6h

原始模型使用fairseq进行微调。本项目使用的是原始模型的转换版本，原始fairseq模型的链接可[在此处](https://drive.google.com/drive/folders/1eRUExXRF2XK8JxUjIzbLBkLa5wuR3nig?usp = sharing)获取。

模型指标总结

	CETUC	CV	LaPS	MLS	SID	TEDx	VF	AVG
bp_400（以下有演示）	0.052	0.140	0.074	0.117	0.121	0.245	0.118	0.124
bp_400 + 3 - gram	0.033	0.095	0.046	0.123	0.112	0.212	0.123	0.106
bp_400 + 4 - gram（以下有演示）	0.030	0.096	0.043	0.106	0.118	0.229	0.117	0.105
bp_400 + 5 - gram	0.033	0.094	0.043	0.123	0.111	0.210	0.123	0.105
bp_400 + Transf.	0.032	0.092	0.036	0.130	0.115	0.215	0.125	0.106

转录示例

原文	转录结果
alguém sabe a que horas começa o jantar	alguém sabe a que horas começo jantar
lila covas ainda não sabe o que vai fazer no fundo	lilacovas ainda não sabe o que vai fazer no fundo
que tal um pouco desse bom spaghetti	quetá um pouco deste bom ispaguete
hong kong em cantonês significa porto perfumado	rongkong en cantones significa porto perfumado
vamos hackear esse problema	vamos rackar esse problema
apenas a poucos metros há uma estação de ônibus	apenas ha poucos metros á uma estação de ônibus
relâmpago e trovão sempre andam juntos	relampagotrevão sempre andam juntos

🚀 快速开始

模型使用示例

MODEL_NAME = "lgris/bp400-xlsr"

导入依赖库

%%capture
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
!pip install datasets
!pip install jiwer
!pip install transformers
!pip install soundfile
!pip install pyctcdecode
!pip install https://github.com/kpu/kenlm/archive/master.zip

import jiwer
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
from pyctcdecode import build_ctcdecoder
import torch
import re
import sys

辅助函数

chars_to_ignore_regex = '[\,\?\.\!\;\:\"]'  # noqa: W605

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = speech.squeeze(0).numpy() 
    batch["sampling_rate"] = 16_000 
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    batch["target"] = batch["sentence"]
    return batch

def calc_metrics(truths, hypos):
    wers = []
    mers = []
    wils = []
    for t, h in zip(truths, hypos):
        try:
            wers.append(jiwer.wer(t, h))
            mers.append(jiwer.mer(t, h))
            wils.append(jiwer.wil(t, h))
        except: # 空字符串情况
            pass
    wer = sum(wers)/len(wers)
    mer = sum(mers)/len(mers)
    wil = sum(wils)/len(wils)
    return wer, mer, wil

def load_data(dataset):
    data_files = {'test': f'{dataset}/test.csv'}
    dataset = load_dataset('csv', data_files=data_files)["test"]
    return dataset.map(map_to_array)

模型定义

class STT:

    def __init__(self, 
                 model_name, 
                 device='cuda' if torch.cuda.is_available() else 'cpu', 
                 lm=None):
        self.model_name = model_name
        self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.vocab_dict = self.processor.tokenizer.get_vocab()
        self.sorted_dict = {
            k.lower(): v for k, v in sorted(self.vocab_dict.items(), 
                                            key=lambda item: item[1])
        }
        self.device = device
        self.lm = lm
        if self.lm:            
            self.lm_decoder = build_ctcdecoder(
                list(self.sorted_dict.keys()),
                self.lm
            )

    def batch_predict(self, batch):
        features = self.processor(batch["speech"], 
                                  sampling_rate=batch["sampling_rate"][0], 
                                  padding=True, 
                                  return_tensors="pt")
        input_values = features.input_values.to(self.device)
        attention_mask = features.attention_mask.to(self.device)
        with torch.no_grad():
            logits = self.model(input_values, attention_mask=attention_mask).logits
        if self.lm:
            logits = logits.cpu().numpy()
            batch["predicted"] = []
            for sample_logits in logits:
                batch["predicted"].append(self.lm_decoder.decode(sample_logits))
        else:
            pred_ids = torch.argmax(logits, dim=-1)
            batch["predicted"] = self.processor.batch_decode(pred_ids)
        return batch

下载数据集

%%capture
!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
!mkdir bp_dataset
!unzip bp_dataset -d bp_dataset/

测试

基础测试

stt = STT(MODEL_NAME)

CETUC数据集测试

ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

输出结果：

CETUC WER: 0.05159104708285062

Common Voice数据集测试

ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

输出结果：

CV WER: 0.14031426198658084

LaPS数据集测试

ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

输出结果：

Laps WER: 0.07432133838383838

MLS数据集测试

ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

输出结果：

MLS WER: 0.11678793514817509

SID数据集测试

ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

输出结果：

Sid WER: 0.12152357273433984

TEDx数据集测试

ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

输出结果：

TEDx WER: 0.24666815906766504

VoxForge数据集测试

ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

输出结果：

VoxForge WER: 0.11873106060606062

使用语言模型（LM）的测试

!rm -rf ~/.cache
!gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP  # 使用维基百科训练的模型
stt = STT(MODEL_NAME, lm='pt - BR - wiki.word.4 - gram.arpa')
# !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg  # 使用巴西葡萄牙语数据训练的模型
# stt = STT(MODEL_NAME, lm='pt - BR.word.4 - gram.arpa')

CETUC数据集使用LM测试

ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

输出结果：

CETUC WER: 0.030266462438593742

Common Voice数据集使用LM测试

ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

输出结果：

CV WER: 0.09577710237417715

LaPS数据集使用LM测试

ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

输出结果：

Laps WER: 0.043617424242424235

MLS数据集使用LM测试

ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

输出结果：

MLS WER: 0.10642133314350002

SID数据集使用LM测试

ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

输出结果：

Sid WER: 0.11839021001747055

TEDx数据集使用LM测试

ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

输出结果：

TEDx WER: 0.22929952467810416

VoxForge数据集使用LM测试

ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

输出结果：

VoxForge WER: 0.11716314935064935

模型信息表格

属性	详情
模型类型	bp400 - xlsr：基于巴西葡萄牙语（BP）数据集的Wav2vec 2.0模型
训练数据	CETUC、Common Voice 7.0、Lapsbm、Multilingual Librispeech (MLS)、Multilingual TEDx、Sidney (SID)、VoxForge
许可证	apache - 2.0