语言: 英文
数据集:
- librispeech_asr
标签:
- 音频
- 自动语音识别
许可证: apache-2.0
小部件:
- 示例标题: Librispeech 样本1
来源: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- 示例标题: Librispeech 样本2
来源: https://cdn-media.huggingface.co/speech_samples/sample2.flac
Wav2Vec2-Base-960h
此仓库是对官方Facebook的wav2vec的重新实现。
关于如何将wav2vec预训练模型转换为pytorch.bin文件的描述尚未提供。
我们正在从预训练模型重建pytorch.bin文件。
以下是转换方法。
pip install transformers[sentencepiece]
pip install fairseq -U
git clone https://github.com/huggingface/transformers.git
cp transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py .
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt -O ./wav2vec_small_960h.pt
mkdir dict
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt
mkdir outputs
python convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path ./outputs --checkpoint_path ./wav2vec_small_960h.pt --dict_path ./dict
使用方法
要将音频文件转录,可以按以下方式使用该模型作为独立的声学模型:
from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
def map_to_array(batch):
speech, _ = sf.read(batch["file"])
batch["speech"] = speech
return batch
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)
input_values = tokenizer(ds["speech"][:2], return_tensors="pt", padding="longest").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)
评估
此代码片段展示了如何在LibriSpeech的“clean”和“other”测试数据上评估facebook/wav2vec2-base-960h。
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import soundfile as sf
import torch
from jiwer import wer
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
def map_to_array(batch):
speech, _ = sf.read(batch["file"])
batch["speech"] = speech
return batch
librispeech_eval = librispeech_eval.map(map_to_array)
def map_to_pred(batch):
input_values = tokenizer(batch["speech"], return_tensors="pt", padding="longest").input_values
with torch.no_grad():
logits = model(input_values.to("cuda")).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)
batch["transcription"] = transcription
return batch
result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))
结果(WER):
参考
Facebook的Wav2Vec2
Facebook的huggingface Wav2Vec2
论文