chunkformer-large-vie开源越南语语音识别模型 - 精准识别约3000小时语音数据

首页

Chunkformer Large Vie

由 khanhld 开发

基于ChunkFormer架构的大规模越南语自动语音识别模型，在约3000小时的越南语公开语音数据上微调，性能优异。

语音识别

PyTorch

其他#越南语语音识别 #长音频处理 #低词错误率

下载量 1,765

发布时间 : 2/1/2025

模型简介

ChunkFormer-Large-Vie是一个专门针对越南语优化的自动语音识别模型，采用ChunkFormer架构，在多个公开数据集上取得了领先的性能表现。

模型特点

高性能越南语识别

在Common Voice Vi和VIVOS数据集上取得SOTA成绩，WER分别为6.66和4.18。

长音频处理能力

支持长音频转录，通过分块处理技术优化内存使用和计算效率。

多数据集训练

在约3000小时的多样化越南语语音数据上训练，覆盖多种场景和口音。

模型能力

越南语语音识别

长音频转录

实时语音转文字

使用案例

语音转写

会议记录

将越南语会议录音自动转写为文字记录

高准确率的转录结果

语音助手

为越南语语音助手提供语音识别能力

低延迟、高准确率的识别

教育

语言学习

帮助学习者练习越南语发音和听力

提供准确的发音评估

🚀 ChunkFormer-Large-Vie：用于越南语自动语音识别的大规模预训练ChunkFormer模型

ChunkFormer-Large-Vie是一个基于ChunkFormer架构的大规模越南语自动语音识别（ASR）模型，在ICASSP 2025会议上被提出。该模型解决了越南语语音识别的准确性和效率问题，为越南语语音处理提供了强大的工具，具有重要的应用价值。

🚀 快速开始

要使用ChunkFormer模型进行越南语自动语音识别，请按照以下步骤操作：

1. 下载ChunkFormer仓库

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

2. 从Hugging Face下载模型检查点

pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"

或者

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

这将把模型检查点下载到chunkformer目录内的checkpoints文件夹中。

3. 运行模型

python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-vie \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ # 以秒为单位，默认值为1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

示例输出：

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

高级用法 可在此处找到。

✨ 主要特性

ChunkFormer架构：ChunkFormer-Large-Vie基于ChunkFormer架构，在ICASSP 2025会议上被提出。
大规模预训练：该模型在约3000小时的公开越南语语音数据上进行了微调，这些数据来自多个不同的数据集。

📦 安装指南

下载ChunkFormer仓库

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

下载模型检查点

pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"

或者

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-vie

📚 详细文档

ChunkFormer的文档和实现是公开可用的。

🔧 技术细节

模型描述

ChunkFormer-Large-Vie 是一个基于 ChunkFormer 架构的大规模越南语自动语音识别（ASR）模型，在 ICASSP 2025 会议上被提出。该模型在约 3000 小时 的公开越南语语音数据上进行了微调，这些数据来自多个不同的数据集。数据集列表可在此处找到。

!!! 请注意，仅使用了 [train-subset] 来调整模型。

基准测试结果

我们使用 单词错误率（WER） 来评估模型。为了确保比较的一致性和公平性，我们手动应用了 文本归一化，包括处理数字、大写字母和标点符号。

公开模型

STT	模型	参数数量	Vivos	通用语音	VLSP - 任务 1	平均值
1	ChunkFormer	110M	4.18	6.66	14.09	8.31
2	vinai/PhoWhisper-large	1.55B	4.67	8.14	13.75	8.85
3	nguyenvulebinh/wav2vec2-base-vietnamese-250h	95M	10.77	18.34	13.33	14.15
4	openai/whisper-large-v3	1.55B	8.81	15.45	20.41	14.89
5	khanhld/wav2vec2-base-vietnamese-160h	95M	15.05	10.78	31.62	19.16
6	homebrewltd/Ichigo-whisper-v0.1	22M	13.46	23.52	21.64	19.54

私有模型（API）

STT	模型	VLSP - 任务 1
1	ChunkFormer	14.1
2	Viettel	14.5
3	Google	19.5
4	FPT	28.8

📄 许可证

本模型采用 CC BY-NC 4.0 许可证。

📖 引用

如果您在研究中使用了此工作，请引用：

@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}
}