许可证:apache-2.0
数据集:
- KELONMYOSA/dusha_emotion_audio
语言:
- 俄语
任务标签:音频分类
标签:
- 音频
- 音频分类
评估指标:
- 准确率
示例展示:
- 示例标题:情绪 - "中性"
音频链接:https://huggingface.co/KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru/resolve/main/neutral.mp3
- 示例标题:情绪 - "积极"
音频链接:https://huggingface.co/KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru/resolve/main/positive.mp3
- 示例标题:情绪 - "愤怒"
音频链接:https://huggingface.co/KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru/resolve/main/angry.mp3
- 示例标题:情绪 - "悲伤"
音频链接:https://huggingface.co/KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru/resolve/main/sad.mp3
- 示例标题:情绪 - "其他"
音频链接:https://huggingface.co/KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru/resolve/main/other.mp3
语音情绪识别
该模型是基于facebook/wav2vec2-xls-r-300m微调后的版本,用于语音情绪识别(SER)任务。
用于微调原始预训练模型的数据集是DUSHA数据集。该数据集包含约125,000条俄语录音,涵盖了与虚拟助手对话中常见的四种基本情绪:快乐(积极)、悲伤、愤怒和中性情绪。
emotions = ['neutral', 'positive', 'angry', 'sad', 'other']
使用方法
使用Pipeline
from transformers.pipelines import pipeline
pipe = pipeline(model="KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru", trust_remote_code=True)
result = pipe("speech.wav")
print(result)
[{'label': 'neutral', 'score': 0.00318}, {'label': 'positive', 'score': 0.00376}, {'label': 'sad', 'score': 0.00145}, {'label': 'angry', 'score': 0.98984}, {'label': 'other', 'score': 0.00176}]
使用AutoModel
import librosa
import torch
import torch.nn.functional as F
from transformers import AutoConfig, Wav2Vec2Processor, AutoModelForAudioClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru"
config = AutoConfig.from_pretrained(model_name_or_path)
processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
sampling_rate = processor.feature_extractor.sampling_rate
model = AutoModelForAudioClassification.from_pretrained(model_name_or_path, trust_remote_code=True).to(device)
def predict(path):
speech, sr = librosa.load(path, sr=sampling_rate)
features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
outputs = [{"label": config.id2label[i], "score": round(score, 5)} for i, score in
enumerate(scores)]
return outputs
print(predict("speech.wav"))
[{'label': 'neutral', 'score': 0.00318}, {'label': 'positive', 'score': 0.00376}, {'label': 'sad', 'score': 0.00145}, {'label': 'angry', 'score': 0.98984}, {'label': 'other', 'score': 0.00176}]
评估结果
模型表现如下:
- 训练损失:0.528700
- 验证损失:0.349617
- 准确率:0.901369
情绪 |
精确率 |
召回率 |
F1分数 |
样本数 |
中性 |
0.92 |
0.94 |
0.93 |
15886 |
积极 |
0.85 |
0.79 |
0.82 |
2481 |
悲伤 |
0.77 |
0.82 |
0.79 |
2506 |
愤怒 |
0.89 |
0.83 |
0.86 |
3072 |
其他 |
0.99 |
0.74 |
0.85 |
226 |
|
|
|
|
|
准确率 |
|
|
0.90 |
24171 |
宏平均 |
0.89 |
0.82 |
0.85 |
24171 |
加权平均 |
0.90 |
0.90 |
0.90 |
24171 |