whisper-large-v3-msp-podcast-emotion开源语音情感识别模型，支持9种情感分类

首页

Whisper Large V3 Msp Podcast Emotion

由 tiantiaf 开发

基于Whisper-Large V3的语音情感识别模型，专为MSP-Podcast数据集优化，支持9种情感分类

音频分类

Safetensors

英语#语音情感识别 #纯语音系统 #短时音频优化

下载量 282

发布时间 : 5/22/2025

模型简介

该模型实现了语音情感识别功能，基于MSP-Podcast数据集训练，特别适合对网络内容进行情感分类。

模型特点

高效纯语音系统

未使用文本转录，构建了简洁高效的纯语音情感识别系统

多样化情感分类

支持9种情感类别识别，包括愤怒、快乐、悲伤等

网络内容优化

特别适合对网络音频内容进行情感分类

模型能力

语音情感识别

音频分类

语音特征提取

使用案例

内容分析

播客情感分析

分析播客内容中的情感倾向

可识别9种不同情感状态

社交媒体监控

监测社交媒体音频内容的情感倾向

帮助识别潜在负面情绪内容

🚀 用于分类情感分类的Whisper-Large V3

本模型基于Whisper-Large V3实现分类情感分类，可有效识别语音中的多种情感，为语音情感分析提供了强大的工具。

🚀 快速开始

本模型可用于语音情感分类任务，下面将介绍如何使用该模型。

✨ 主要特性

模型实现：本模型实现了Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648) 中描述的分类情感分类。
训练管道：使用的训练管道是INTERSPEECH 2025—Speech Emotion Challenge (https://lab-msp.com/MSP-Podcast_Competition/IS2025/) 中的最佳解决方案（SAILER）。
训练数据：使用MSP-Podcast数据进行训练，模型在进行情感预测时可能对内容信息敏感，这对于从在线内容中分类情感是一个很好的特性。
支持的情感类别：包括愤怒、轻蔑、厌恶、恐惧、快乐、中性、悲伤、惊讶和其他。

📦 安装指南

下载仓库

git clone git@github.com:tiantiaf0627/vox-profile-release.git

安装包

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

💻 使用示例

基础用法

# Load libraries
import torch
import torch.nn.functional as F
from src.model.emotion.whisper_emotion import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
model.eval()

高级用法

# Label List
emotion_label_list = [
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embedding, _, _, _, _ = model(
    data, return_feature=True
)
    
# Probability and output
emotion_prob = F.softmax(logits, dim=1)
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])

📚 详细文档

模型描述

本模型实现了Vox-Profile中描述的分类情感分类，使用的训练管道是INTERSPEECH 2025—Speech Emotion Challenge中的最佳解决方案（SAILER）。与官方挑战提交系统相比，本模型未使用所有增强方法，也未使用转录文本，而是创建了一个仅基于语音的系统，使模型简单但仍然有效。

支持的情感类别

[
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]

📄 许可证

本模型使用BSD 2-Clause许可证。

引用信息

如果您使用了本模型或在您的工作中发现它很有用，请引用我们的论文：

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}