wav2vec2-large-xlsr-53-greek开源希腊语语音识别模型

首页

Wav2vec2 Large Xlsr 53 Greek

由 vasilis 开发

基于facebook/wav2vec2-large-xlsr-53模型微调的希腊语语音识别模型，支持16kHz采样率的语音输入。

语音识别

Transformers

其他开源协议:Apache-2.0 #希腊语语音识别 #XLSR微调 #多数据集训练

下载量 25

发布时间 : 3/2/2022

模型简介

该模型是针对希腊语优化的自动语音识别(ASR)模型，基于Wav2Vec2架构，使用通用语音和CSS10希腊语单说话人数据集进行微调。

模型特点

多数据集微调

结合通用语音和CSS10希腊语单说话人数据集进行训练，提高模型识别准确性

文本标准化处理

对希腊语特殊字符进行标准化处理，如将ς转换为σ，提升识别效果

无需语言模型

可直接使用进行语音识别，无需额外语言模型支持

模型能力

希腊语语音识别

16kHz音频处理

实时语音转文字

使用案例

语音转写

希腊语会议记录

将希腊语会议录音自动转写为文字

词错误率18.99%，字符错误率5.78%

语音助手

用于希腊语语音助手应用的语音识别模块

教育

语言学习应用

帮助学习者练习希腊语发音和听力

🚀 Wav2Vec2-Large-XLSR-53-希腊语

本项目基于 Common Voice 和 CSS10 希腊语：单说话人语音数据集，对 facebook/wav2vec2-large-xlsr-53 模型进行了希腊语微调。使用该模型时，请确保语音输入采样率为 16kHz。

🚀 快速开始

在使用此模型前，请确保你的语音输入采样率为 16kHz。

✨ 主要特性

多数据集微调：基于 Common Voice 和 CSS10 希腊语单说话人语音数据集进行微调。
特定语言优化：专门针对希腊语进行优化，提升语音识别效果。

📦 安装指南

文档未提及安装步骤，故跳过该章节。

💻 使用示例

基础用法

该模型可以直接使用（无需语言模型），示例代码如下：

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "el", split="test[:2%]") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.

processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-greek") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-greek") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 详细文档

评估

可以使用以下代码在 Common Voice 的希腊语测试数据上对模型进行评估：

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "el", split="test") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-greek") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-greek") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' # TODO: adapt this list to include all special characters you removed from the data

normalize_greek_letters = {"ς": "σ"}
# normalize_greek_letters = {"ά": "α", "έ": "ε", "ί": "ι", 'ϊ': "ι", "ύ": "υ", "ς": "σ", "ΐ": "ι", 'ϋ': "υ", "ή": "η", "ώ": "ω", 'ό': "ο"}
remove_chars_greek = {"a": "", "h": "", "n": "", "g": "", "o": "", "v": "", "e": "", "r": "", "t": "", "«": "", "»": "", "m": "", '́': '', "·": "", "’": "", '´': ""}
replacements = {**normalize_greek_letters, **remove_chars_greek}

resampler = {
    48_000: torchaudio.transforms.Resample(48_000, 16_000),
    44100: torchaudio.transforms.Resample(44100, 16_000),
    32000: torchaudio.transforms.Resample(32000, 16_000)
}


# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    for key, value in replacements.items():
        batch["sentence"] = batch["sentence"].replace(key, value)
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler[sampling_rate](speech_array).squeeze().numpy()
    return batch


test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {:2f}".format(100 * wer.compute(predictions=[" ".join(list(entry)) for entry in result["pred_strings"]], references=[" ".join(list(entry)) for entry in result["sentence"]])))

测试结果：18.996669 %

训练

训练使用了 Common Voice 训练数据集，同时使用了 CSS10 希腊语 的所有归一化转录数据。在文本预处理过程中，字母 ς 被归一化为 σ，原因是这两个字母发音相同，且 ς 仅用作单词的结尾字符，因此这种更改可以轻松映射到正确的听写。尝试去除字母上的所有重音符号也显著提高了 WER。该模型在未收敛的情况下，WER 轻松达到了 17%。然而，后续修复转录所需的文本预处理会更加复杂。不过，语言模型应该可以轻松解决这些问题。另一个可以尝试的方法是将所有 ι、η 等字母更改为单个字符，因为它们发音相同，对于 o 和 ω 也是如此，这将显著有助于声学模型部分，因为所有这些字符都映射到相同的声音，但需要进一步的文本归一化。

🔧 技术细节

模型信息

属性	详情
模型类型	基于 XLSR 微调的 Wav2Vec2 大模型
训练数据	Common Voice 训练数据集、CSS10 希腊语数据集（使用归一化转录）
评估指标	词错误率（WER）、字符错误率（CER）