标签:
- pyannote
- pyannote音频
- pyannote音频管道
- 音频
- 语音
- 说话人
- 说话人分割
- 说话人变化检测
- 语音活动检测
- 重叠语音检测
数据集:
- ami
- dihard
- voxconverse
- aishell
- repere
- voxceleb
许可证: mit
🎹 说话人分割
基于pyannote.audio 2.0版本:查看安装说明。
快速开始
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2022.07")
diarization = pipeline("audio.wav")
with open("audio.rttm", "w") as rttm:
diarization.write_rttm(rttm)
高级用法
如果已知说话人数量,可以使用num_speakers
选项:
diarization = pipeline("audio.wav", num_speakers=2)
也可以使用min_speakers
和max_speakers
选项提供说话人数量的上下限:
diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
如果想尝试更多,可以调整管道的各种超参数。
例如,通过增加segmentation_onset
阈值来使用更激进的语音活动检测:
hparams = pipeline.parameters(instantiated=True)
hparams["segmentation_onset"] += 0.1
pipeline.instantiate(hparams)
基准测试
实时因子
使用一块Nvidia Tesla V100 SXM2 GPU(用于神经网络推理部分)和一块Intel Cascade Lake 6248 CPU(用于聚类部分),实时因子约为5%。
换句话说,处理一小时对话大约需要3分钟。
准确率
该管道在多个数据集上进行了基准测试。
处理过程完全自动化:
- 无需手动语音活动检测(如文献中有时所述)
- 无需手动指定说话人数量(尽管可以提供给管道)
- 无需对内部模型进行微调,也无需为每个数据集调整管道超参数
...采用最严格的说话人分割错误率(DER)设置(在本文中称为*"Full"*):
支持
商业咨询和科学合作,请联系我。
技术问题和错误报告,请查看pyannote.audio的GitHub仓库。
引用
@inproceedings{Bredin2021,
Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
Booktitle = {Proc. Interspeech 2021},
Address = {Brno, Czech Republic},
Month = {August},
Year = {2021},
}
@inproceedings{Bredin2020,
Title = {{pyannote.audio: neural building blocks for speaker diarization}},
Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
Address = {Barcelona, Spain},
Month = {May},
Year = {2020},
}