标签:
- espnet
- 音频
- 自动语音识别
- 语音翻译
- 语言识别
语言: 多语言
数据集:
- owsm_v3.2_ctc
基础模型:
- espnet/owsm_ctc_v3.2_ft_1B
许可证: cc-by-4.0
OWSM-CTC (彭等人,ACL 2024) 是一个基于分层多任务自条件CTC的仅编码器语音基础模型。
该模型在180k小时的公开音频数据上训练,支持多语言语音识别、任意到任意语音翻译和语言识别,遵循开放Whisper风格语音模型(OWSM)项目的设计。
此模型以OWSM-CTC v3.1初始化,并在v3.2数据上微调了225k步。
使用预训练模型前,请安装espnet
和espnet_model_zoo
。所需依赖为:
librosa
torch
espnet
espnet_model_zoo
完整实现详见ESPnet仓库: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
批量推理示例脚本
Speech2TextGreedySearch
现提供统一的批量推理方法batch_decode
。该方法对短音频或长音频批次执行CTC贪婪解码。若音频短于30秒将填充至30秒,否则将分割为重叠片段(与下方"长音频ASR/ST"方法相同)。
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.2_ft_1B",
device="cuda",
use_flash_attn=False,
lang_sym='<eng>',
task_sym='<asr>',
)
res = s2t.batch_decode(
"audio.wav",
batch_size=16,
context_len_in_secs=4,
)
res = s2t.batch_decode(
["audio1.wav", "audio2.wav", "audio3.wav"],
batch_size=16,
context_len_in_secs=4,
)
短音频ASR/ST/LID示例脚本
模型训练采用16kHz固定30秒时长音频。使用时请确保输入语音为16kHz并填充/截断至30秒。
import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.2_ft_1B",
device="cuda",
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))
res = s2t(speech)[0]
print(res)
长音频ASR/ST示例脚本
import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
context_len_in_secs = 4
batch_size = 32
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.2_ft_1B",
device='cuda' if torch.cuda.is_available() else 'cpu',
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
speech, rate = sf.read(
"xxx.wav"
)
text = s2t.decode_long_batched_buffered(
speech,
batch_size=batch_size,
context_len_in_secs=context_len_in_secs,
)
print(text)
使用ctc-segmentation
进行CTC强制对齐示例
CTC分割可高效应用于任意长度音频。
import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B")
aligner = CTCSegmentation(
**downloaded,
fs=16000,
ngpu=1,
batch_size=32,
kaldi_style_text=True,
time_stamps="auto",
lang_sym="<eng>",
task_sym="<asr>",
context_len_in_secs=2,
)
speech, rate = sf.read(
"./test_utils/ctc_align_test.wav"
)
print(f"音频时长: {len(speech) / rate : .2f}秒")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""
segments = aligner(speech, text)
print(segments)
引用文献
OWSM-CTC
@inproceedings{owsm-ctc,
title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
author = "Peng, Yifan and
Sudo, Yui and
Shakeel, Muhammad and
Watanabe, Shinji",
booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
year = "2024",
month= {8},
url = "https://aclanthology.org/2024.acl-long.549",
}
OWSM v3.1与v3.2
@inproceedings{owsm-v32,
title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2024},
month={9},
pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2024},
month={9},
pdf="https://arxiv.org/pdf/2401.16658",
}
初始版OWSM (v1, v2, v3)
@inproceedings{owsm,
title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
year={2023},
month={12},
pdf="https://arxiv.org/pdf/2309.13876",
}