Nue ASR开源日语语音识别模型 - 集成双模型，准确快速识别语音

首页

Nue Asr

由 rinna 开发

Nue ASR是一种端到端的日语语音识别模型，集成了预训练的语音和语言模型，识别准确度高且速度快。

语音识别

Transformers

支持多种语言开源协议:Apache-2.0 #日语语音识别 #端到端ASR #预训练模型集成

下载量 722

发布时间 : 12/7/2023

模型简介

该模型提供端到端的日语语音识别，识别准确度与最新的ASR模型相当。通过使用GPU，可以实现比实时更快的语音识别速度。

模型特点

端到端语音识别

集成了预训练的语音和语言模型，提供完整的端到端解决方案。

高性能

识别准确度与最新的ASR模型相当，且推理速度快于实时。

预训练模型集成

使用japanese-hubert-base和japanese-gpt-neox-3.6b的预训练权重初始化。

大规模训练数据

在约19,000小时的日语语音语料库ReazonSpeech v1上进行训练。

模型能力

日语语音识别

端到端语音转文本

实时语音处理

使用案例

语音转写

会议记录

将日语会议录音实时转写为文本

高准确度的会议记录文本

字幕生成

为日语视频内容自动生成字幕

同步的字幕文件

语音助手

日语语音指令识别

识别和理解日语语音命令

准确的指令识别

🚀 `rinna/nue-asr`

rinna/nue-asr是一个集成了预训练语音和语言模型的端到端语音识别模型，可提供媲美最新ASR模型的日语语音识别准确率，在GPU上能实现超实时的语音识别。

🚀 快速开始

本模型代码在Python 3.8.10和3.10.12版本，搭配PyTorch 2.1.1和Transformers 4.35.2进行了测试。此代码库预计兼容Python 3.8及更高版本，以及近期的PyTorch版本。Transformers的版本需为4.33.0或更高。

首先，安装该模型推理所需的代码：

pip install git+https://github.com/rinnakk/nue-asr.git

本模型提供命令行接口和Python接口。

命令行使用

以下命令使用命令行接口转录音频文件，音频文件将自动下采样到16kHz：

nue-asr audio1.wav

你可以指定多个音频文件：

nue-asr audio1.wav audio2.flac audio3.mp3

我们可以使用DeepSpeed-Inference来加速GPT - NeoX模块的推理速度。如果你使用DeepSpeed-Inference，则需要安装DeepSpeed：

pip install deepspeed

然后，你可以按如下方式使用DeepSpeed-Inference：

nue-asr --use-deepspeed audio1.wav

运行nue-asr --help获取更多信息。

Python使用

Python接口示例如下：

import nue_asr

model = nue_asr.load_model("rinna/nue-asr")
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")

result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
print(result.text)

nue_asr.transcribe函数除了接受音频文件路径外，还可以接受numpy.array或torch.Tensor格式的音频数据。

在Python接口中也可以使用DeepSpeed-Inference加速推理速度：

import nue_asr

model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True)
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")

result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
print(result.text)

✨ 主要特性

提出了一种新颖的端到端语音识别模型Nue ASR，集成了预训练的语音和语言模型。
模型名称Nue源自日语词汇（鵺/ぬえ/Nue），这是日本传说中的生物之一（妖怪/ようかい/Yōkai）。
提供端到端的日语语音识别，识别准确率可与近期的ASR模型相媲美。
在GPU上使用时，能够实现超实时的语音识别。

📚 详细文档

模型架构

该模型由三个主要组件组成：HuBERT音频编码器、桥接网络和GPT - NeoX解码器。HuBERT和GPT - NeoX的权重分别使用预训练的HuBERT和GPT - NeoX权重进行初始化：

训练

模型在约19,000小时的日语语音语料库ReazonSpeech v1上进行训练。请注意，训练前排除了时长超过16秒的语音样本：

ReazonSpeech

贡献者

发布日期

2023年12月7日

分词

该模型使用与japanese-gpt-neox-3.6b相同的基于SentencePiece的分词器。

引用方式

@inproceedings{hono2024integrating,
    title = {Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition},
    author = {Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
    booktitle = {Findings of the Association for Computational Linguistics ACL 2024},
    month = {8},
    year = {2024},
    pages = {13289--13305},
    url = {https://aclanthology.org/2024.findings-acl.787}
}

@misc{rinna-nue-asr,
    title = {rinna/nue-asr},
    author = {Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
    url = {https://huggingface.co/rinna/nue-asr}
}

参考文献

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

@article{hsu2021hubert,
    title = {{HuBERT}: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
    author = {Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
    journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    month = {10},
    year = {2021},
    volume = {29},
    pages = {3451--3460},
    doi = {10.1109/TASLP.2021.3122291}
}

@software{andoniangpt2021gpt,
    title = {{GPT}-{N}eo{X}: Large Scale Autoregressive Language Modeling in {P}y{T}orch},
    author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
    month = {8},
    year = {2021},
    version = {0.0.1},
    doi = {10.5281/zenodo.5879544},
    url = {https://www.github.com/eleutherai/gpt-neox}
}

@inproceedings{aminabadi2022deepspeed,
    title = {{DeepSpeed-Inference}: enabling efficient inference of transformer models at unprecedented scale},
    author = {Aminabadi, Reza Yazdani and Rajbhandari, Samyam and Awan, Ammar Ahmad and Li, Cheng and Li, Du and Zheng, Elton and Ruwase, Olatunji and Smith, Shaden and Zhang, Minjia and Rasley, Jeff and others},
    booktitle = {SC22: International Conference for High Performance Computing, Networking, Storage and Analysis},
    year = {2022},
    pages = {1--15},
    doi = {10.1109/SC41404.2022.00051}
}