缩略图: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
语言: 日语
数据集:
- reazon-research/reazonspeech
标签:
- 自动语音识别
- 语音
- 音频
- hubert
- gpt_neox
- asr
- nlp
许可证: apache-2.0
推理: false
rinna/nue-asr

概述
[论文]
[GitHub]
我们提出了一种新颖的端到端语音识别模型Nue ASR
,它集成了预训练的语音和语言模型。
名称Nue
来源于日语单词(鵺/ぬえ/Nue
),这是日本传说中的一种生物(妖怪/ようかい/Yōkai
)。
该模型提供端到端的日语语音识别,识别准确度与最新的ASR模型相当。通过使用GPU,您可以实现比实时更快的语音识别速度。
包括我们模型在内的基准测试分数可以在https://rinnakk.github.io/research/benchmarks/asr/找到。
-
模型架构
该模型由三个主要组件组成:HuBERT音频编码器、桥接网络和GPT-NeoX解码器。HuBERT和GPT-NeoX的权重分别用预训练的HuBERT和GPT-NeoX权重初始化。
-
训练
该模型在约19,000小时的日语语音语料库ReazonSpeech v1上进行训练。注意,训练前排除了超过16秒的语音样本。
-
贡献者
-
发布日期
2023年12月7日
如何使用模型
我们使用Python 3.8.10和3.10.12、PyTorch 2.1.1和Transformers 4.35.2测试了我们的代码。此代码库预计与Python 3.8或更高版本以及最新的PyTorch版本兼容。Transformers的版本应为4.33.0或更高。
首先,安装用于此模型推理的代码。
pip install git+https://github.com/rinnakk/nue-asr.git
提供命令行界面和Python接口。
命令行使用
以下命令使用命令行界面转录音频文件。音频文件将自动降采样至16kHz。
nue-asr audio1.wav
您可以指定多个音频文件。
nue-asr audio1.wav audio2.flac audio3.mp3
我们可以使用DeepSpeed-Inference来加速GPT-NeoX模块的推理速度。如果使用DeepSpeed-Inference,需要安装DeepSpeed。
pip install deepspeed
然后,可以如下使用DeepSpeed-Inference:
nue-asr --use-deepspeed audio1.wav
运行nue-asr --help
获取更多信息。
Python使用
Python接口示例如下:
import nue_asr
model = nue_asr.load_model("rinna/nue-asr")
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")
result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
print(result.text)
nue_asr.transcribe
函数可以接受音频数据作为numpy.array
或torch.Tensor
,以及音频文件路径。
在Python接口中也可以使用DeepSpeed-Inference加速推理速度。
import nue_asr
model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True)
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")
result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
print(result.text)
分词
该模型使用与japanese-gpt-neox-3.6b相同的基于sentencepiece的分词器。
如何引用
@inproceedings{hono2024integrating,
title = {Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition},
author = {Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
booktitle = {Findings of the Association for Computational Linguistics ACL 2024},
month = {8},
year = {2024},
pages = {13289--13305},
url = {https://aclanthology.org/2024.findings-acl.787}
}
@misc{rinna-nue-asr,
title = {rinna/nue-asr},
author = {Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
url = {https://huggingface.co/rinna/nue-asr}
}
参考文献
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}
@article{hsu2021hubert,
title = {{HuBERT}: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
author = {Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
month = {10},
year = {2021},
volume = {29},
pages = {3451--3460},
doi = {10.1109/TASLP.2021.3122291}
}
@software{andoniangpt2021gpt,
title = {{GPT}-{N}eo{X}: Large Scale Autoregressive Language Modeling in {P}y{T}orch},
author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
month = {8},
year = {2021},
version = {0.0.1},
doi = {10.5281/zenodo.5879544},
url = {https://www.github.com/eleutherai/gpt-neox}
}
@inproceedings{aminabadi2022deepspeed,
title = {{DeepSpeed-Inference}: enabling efficient inference of transformer models at unprecedented scale},
author = {Aminabadi, Reza Yazdani and Rajbhandari, Samyam and Awan, Ammar Ahmad and Li, Cheng and Li, Du and Zheng, Elton and Ruwase, Olatunji and Smith, Shaden and Zhang, Minjia and Rasley, Jeff and others},
booktitle = {SC22: International Conference for High Performance Computing, Networking, Storage and Analysis},
year = {2022},
pages = {1--15},
doi = {10.1109/SC41404.2022.00051}
}
许可证
Apache 2.0许可证