开源语音情感识别系统 - 基于Wav2Vec2微调，精准识别7种常见情感

首页

Speech Emotion Recognition With Facebook Wav2vec2 Large Xlsr 53

由 firdhokk 开发

基于Wav2Vec2 Large XLSR-53模型微调的语音情感识别系统，能够识别7种常见情感

音频分类

Transformers

开源协议:Apache-2.0 #语音情感分析 #多语种支持 #高精度识别

下载量 66

发布时间 : 9/20/2024

模型简介

该模型通过微调Wav2Vec2 Large XLSR-53实现语音情感分类，支持愤怒、厌恶、恐惧、快乐、中性、悲伤和惊讶7种情感识别

模型特点

高准确率情感识别

在测试集上达到91.68%的准确率和91.66%的F1值

多数据集训练

融合RAVDESS、SAVEE、TESS和URDU多个数据集进行训练

高效特征提取

使用Wav2Vec2特征提取器处理音频数据，实现标准化特征输入

模型能力

语音情感识别

音频分类

多情感分类

使用案例

人机交互

智能客服情绪分析

分析客户语音中的情绪状态

提升客服响应质量和用户体验

心理健康

情绪状态监测

通过语音分析用户情绪变化

辅助心理健康评估

🚀 🎧 基于Wav2Vec2的语音情感识别

本项目借助 Wav2Vec2 模型实现语音情感识别。旨在将音频记录分类为不同的情感类别，如快乐、悲伤、惊讶等。

🚀 快速开始

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

model_id = "firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True, return_attention_mask=True)
id2label = model.config.id2label

def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)
    
    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_attention_mask=True,
        return_tensors="pt",
    )
    return inputs

def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]
    
    return predicted_label

audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"

predicted_emotion = predict_emotion(audio_path, model, feature_extractor, id2label)
print(f"Predicted Emotion: {predicted_emotion}")

✨ 主要特性

利用 Wav2Vec2 模型进行语音情感识别。
支持将音频分类为多种情感类别。
训练和评估使用了多个公开数据集。

📦 安装指南

文档未提及安装步骤，此处跳过。

💻 使用示例

基础用法

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

model_id = "firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True, return_attention_mask=True)
id2label = model.config.id2label

高级用法

def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)
    
    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_attention_mask=True,
        return_tensors="pt",
    )
    return inputs

def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]
    
    return predicted_label

audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"

predicted_emotion = predict_emotion(audio_path, model, feature_extractor, id2label)
print(f"Predicted Emotion: {predicted_emotion}")

📚 详细文档

🗂 数据集

训练和评估使用的数据集来自多个公开数据集，包括：

数据集包含标注了各种情感的录音。以下是数据集中情感的分布情况：

情感	数量
悲伤	752
快乐	752
愤怒	752
中立	716
厌恶	652
恐惧	652
惊讶	652
平静	192

这种分布反映了数据集中情感的平衡情况，有些情感的样本比其他情感多。由于“平静”情感的样本数量不足，在训练过程中排除了该情感。

🎤 预处理

音频加载：使用 Librosa 加载音频文件并将其转换为 numpy 数组。
特征提取：使用 Wav2Vec2 特征提取器 处理音频数据，对音频特征进行标准化和归一化，以便输入到模型中。

🔧 模型

使用的模型是 Wav2Vec2 Large XLR-53 模型，并针对 音频分类 任务进行了微调：

模型：facebook/wav2vec2-large-xlsr-53
输出：情感标签 (愤怒', '厌恶', '恐惧', '快乐', '中立', '悲伤', '惊讶') 将情感标签映射为数字 ID，并用于模型的训练和评估。

⚙️ 训练

模型使用以下参数进行训练：

学习率：5e-05
训练批次大小：2
评估批次大小：2
随机种子：42
梯度累积步数：5
总训练批次大小：10（梯度累积后的有效批次大小）
优化器：Adam，参数为 betas=(0.9, 0.999) 和 epsilon=1e-08
学习率调度器：linear
学习率调度器的热身比例：0.1
训练轮数：25
混合精度训练：原生 AMP（自动混合精度）

这些参数确保了模型训练的效率和稳定性，特别是在处理像 Wav2Vec2 这样的大型数据集和深度模型时。训练过程使用 Wandb 进行实验跟踪和监控。

📊 指标

模型训练后获得的评估指标如下：

损失：0.4989
准确率：0.9168
精确率：0.9209
召回率：0.9168
F1 分数：0.9166

这些指标展示了模型在语音情感识别任务上的性能。准确率、精确率、召回率和 F1 分数的高值表明，模型能够有效地从语音数据中识别情感状态。

🧪 结果

训练完成后，在测试数据集上对模型进行评估，并使用 Wandb 在此链接监控结果。

训练损失	轮数	步数	验证损失	准确率	精确率	召回率	F1 分数
1.9343	0.9995	394	1.9277	0.2505	0.1425	0.2505	0.1691
1.7944	1.9990	788	1.6446	0.4574	0.5759	0.4574	0.4213
1.4601	2.9985	1182	1.3242	0.5953	0.6183	0.5953	0.5709
1.0551	3.9980	1576	1.0764	0.6623	0.6659	0.6623	0.6447
0.8934	5.0	1971	0.9209	0.7059	0.7172	0.7059	0.6825
1.1156	5.9995	2365	0.8292	0.7465	0.7635	0.7465	0.7442
0.6307	6.9990	2759	0.6439	0.8043	0.8090	0.8043	0.8020
0.774	7.9985	3153	0.6666	0.7921	0.8117	0.7921	0.7916
0.5537	8.9980	3547	0.5111	0.8245	0.8268	0.8245	0.8205
0.3762	10.0	3942	0.5506	0.8306	0.8390	0.8306	0.8296
0.716	10.9995	4336	0.5499	0.8276	0.8465	0.8276	0.8268
0.5372	11.9990	4730	0.5463	0.8377	0.8606	0.8377	0.8404
0.3746	12.9985	5124	0.4758	0.8611	0.8714	0.8611	0.8597
0.4317	13.9980	5518	0.4438	0.8742	0.8843	0.8742	0.8756
0.2104	15.0	5913	0.4426	0.8803	0.8864	0.8803	0.8806
0.3193	15.9995	6307	0.4741	0.8671	0.8751	0.8671	0.8683
0.3445	16.9990	6701	0.3850	0.9037	0.9047	0.9037	0.9038
0.2777	17.9985	7095	0.4802	0.8834	0.8923	0.8834	0.8836
0.4406	18.9980	7489	0.4053	0.9047	0.9096	0.9047	0.9043
0.1707	20.0	7884	0.4434	0.9067	0.9129	0.9067	0.9069
0.2138	20.9995	8278	0.5051	0.9037	0.9155	0.9037	0.9053
0.1812	21.9990	8672	0.4238	0.8955	0.9007	0.8955	0.8953
0.3639	22.9985	9066	0.4021	0.9138	0.9182	0.9138	0.9143
0.3193	23.9980	9460	0.4989	0.9168	0.9209	0.9168	0.9166
0.2067	24.9873	9850	0.4959	0.8976	0.9032	0.8976	0.8975