ruRoberta-large-ru-go-emotions开源模型 - 精准检测俄语文本27种情绪类型

首页

Ruroberta Large Ru Go Emotions

由 fyaronskiy 开发

基于ruRoberta-large微调的多标签情绪分类模型，可检测俄语文本中的27种情绪类型，是目前俄语开源模型中性能最优的情绪检测模型。

文本分类

Transformers

其他开源协议:MIT #俄语多情绪检测 #高精度情感分析 #多标签分类

下载量 813

发布时间 : 8/19/2024

模型简介

该模型在ru_go_emotions数据集上微调，专门用于俄语文本的多标签情绪分类任务，能够识别包括钦佩、愤怒、快乐等27种情绪类型。

模型特点

多标签情绪分类

支持同时检测文本中的多种情绪，而非单一情绪分类

最优阈值优化

通过验证集优化每个情绪类别的独立阈值，最大化F1宏平均得分

高性能表现

在俄语情绪检测任务中达到当前开源模型的最佳性能（F1宏平均0.48）

ONNX支持

提供ONNX和INT8量化版本，推理速度最高可提升2.5倍

模型能力

俄语文本情绪分析

多标签情绪检测

情绪概率预测

情绪强度评估

使用案例

社交媒体分析

用户评论情绪分析

分析社交媒体上用户评论的情绪倾向

可识别出评论中的主要情绪如快乐、愤怒等

客户服务

客户反馈情绪检测

自动分析客户反馈中的情绪状态

帮助识别不满客户（愤怒、失望）和满意客户（感激、快乐）

🚀 ruRoberta-large-ru-go-emotions 情感分类模型

这是一款强大的俄语开源模型，能够精准检测文本中的 27 种情感类型，为文本情感分析提供了高效且准确的解决方案。

🚀 快速开始

本模型是基于 ruRoberta-large 在 ru_go_emotions 数据集上微调得到的多标签分类模型。可用于从文本中提取所有情感或检测特定情感。阈值是在验证集上通过最大化所有标签的 F1 宏得分来选择的。

✨ 主要特性

多情感检测：能够检测 27 种不同的情感类型，覆盖范围广泛。
多版本支持：提供了完整精度的 ONNX 模型和 INT8 量化模型，在保证性能的同时提升了推理速度。
性能优异：在多个指标上表现出色，如 F1 宏得分、精度和召回率等。

📦 安装指南

使用该模型，你需要安装 transformers 库，可以使用以下命令进行安装：

pip install transformers

💻 使用示例

基础用法

以下是使用 Huggingface Transformers 库调用模型的基础代码示例：

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("fyaronskiy/ruRoberta-large-ru-go-emotions")
model = AutoModelForSequenceClassification.from_pretrained("fyaronskiy/ruRoberta-large-ru-go-emotions")

best_thresholds = [0.36734693877551017, 0.2857142857142857, 0.2857142857142857, 0.16326530612244897, 0.14285714285714285, 0.14285714285714285, 0.18367346938775508, 0.3469387755102041, 0.32653061224489793, 0.22448979591836732, 0.2040816326530612, 0.2857142857142857, 0.18367346938775508, 0.2857142857142857, 0.24489795918367346, 0.7142857142857142, 0.02040816326530612, 0.3061224489795918, 0.44897959183673464, 0.061224489795918366, 0.18367346938775508, 0.04081632653061224, 0.08163265306122448, 0.1020408163265306, 0.22448979591836732, 0.3877551020408163, 0.3469387755102041, 0.24489795918367346]
LABELS = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
ID2LABEL = dict(enumerate(LABELS))

高级用法

提取文本中的情感

def predict_emotions(text):
  inputs = tokenizer(text, truncation=True, add_special_tokens=True, max_length=128, return_tensors='pt')
  with torch.no_grad():
      logits = model(**inputs).logits
  probas = torch.sigmoid(logits).squeeze(dim=0)  
  class_binary_labels = (probas > torch.tensor(best_thresholds)).int()
  return [ID2LABEL[label_id] for label_id, value in enumerate(class_binary_labels) if value == 1]

print(predict_emotions('У вас отличный сервис и лучший кофе в городе, обожаю вашу кофейню!'))

#['admiration', 'love']

获取所有情感及其得分

def predict(text):
    inputs = tokenizer(text, truncation=True, add_special_tokens=True, max_length=128, return_tensors='pt')
    with torch.no_grad():
        logits = model(**inputs).logits
    probas = torch.sigmoid(logits).squeeze(dim=0).tolist()
    probas = [round(proba, 3) for proba in probas]    
    
    labels2probas = dict(zip(LABELS, probas))
    probas_dict_sorted = dict(sorted(labels2probas.items(), key=lambda x: x[1], reverse=True))
    return probas_dict_sorted

print(predict('У вас отличный сервис и лучший кофе в городе, обожаю вашу кофейню!'))
'''{'admiration': 0.81,
 'love': 0.538,
 'joy': 0.041,
 'gratitude': 0.031,
 'approval': 0.026,
 'excitement': 0.023,
 'neutral': 0.009,
 'curiosity': 0.006,
 'amusement': 0.005,
 'desire': 0.005,
 'realization': 0.005,
 'caring': 0.004,
 'confusion': 0.004,
 'surprise': 0.004,
 'disappointment': 0.003,
 'disapproval': 0.003,
 'anger': 0.002,
 'annoyance': 0.002,
 'disgust': 0.002,
 'fear': 0.002,
 'grief': 0.002,
 'optimism': 0.002,
 'pride': 0.002,
 'relief': 0.002,
 'sadness': 0.002,
 'embarrassment': 0.001,
 'nervousness': 0.001,
 'remorse': 0.001}
'''

📚 详细文档

模型评估结果

在 ru-go-emotions 测试集上的评估结果如下：

情感类型	精度	召回率	F1 得分	样本数	阈值
admiration	0.63	0.75	0.69	504	0.37
amusement	0.76	0.91	0.83	264	0.29
anger	0.47	0.32	0.38	198	0.29
annoyance	0.33	0.39	0.36	320	0.16
approval	0.27	0.58	0.37	351	0.14
caring	0.32	0.59	0.41	135	0.14
confusion	0.41	0.52	0.46	153	0.18
curiosity	0.45	0.73	0.55	284	0.35
desire	0.54	0.31	0.40	83	0.33
disappointment	0.31	0.34	0.33	151	0.22
disapproval	0.31	0.57	0.40	267	0.20
disgust	0.44	0.40	0.42	123	0.29
embarrassment	0.48	0.38	0.42	37	0.18
excitement	0.29	0.43	0.34	103	0.29
fear	0.56	0.78	0.65	78	0.24
gratitude	0.95	0.85	0.89	352	0.71
grief	0.03	0.33	0.05	6	0.02
joy	0.48	0.58	0.53	161	0.31
love	0.73	0.84	0.78	238	0.45
nervousness	0.24	0.48	0.32	23	0.06
optimism	0.57	0.54	0.56	186	0.18
pride	0.67	0.38	0.48	16	0.04
realization	0.18	0.31	0.23	145	0.08
relief	0.30	0.27	0.29	11	0.10
remorse	0.53	0.84	0.65	56	0.22
sadness	0.56	0.53	0.55	156	0.39
surprise	0.55	0.57	0.56	141	0.35
neutral	0.59	0.79	0.68	1787	0.24
micro avg	0.50	0.66	0.57	6329
macro avg	0.46	0.55	0.48	6329
weighted avg	0.53	0.66	0.58	6329

ONNX 和量化版本模型

完整精度 ONNX 模型（onnx/model.onnx）：比 Transformer 模型快 1.5 倍，性能相同。
INT8 量化模型（onnx/model_quantized.onnx）：比 Transformer 模型快 2.5 倍，性能几乎相同。

以下是在测试集上对 5427 个样本进行推理测试的结果：

模型	大小	F1 宏得分	加速比	推理时间
原始模型	1.4 GB	0.48	1x	44 分 55 秒
onnx.model	1.4 GB	0.48	1.5x	29 分 52 秒
model_quantized.onnx	0.36 GB	0.48	2.5x	18 分 10 秒

使用 ONNX 版本模型

加载完整精度模型

from optimum.onnxruntime import ORTModelForSequenceClassification
model_id = "fyaronskiy/ruRoberta-large-ru-go-emotions"
file_name = "onnx/model.onnx"
model = ORTModelForSequenceClassification.from_pretrained(model_id, file_name=file_name)
tokenizer = AutoTokenizer.from_pretrained(model_id)

加载 INT8 量化模型

model_id = "fyaronskiy/ruRoberta-large-ru-go-emotions"
file_name = "onnx/model_quantized.onnx"

model = ORTModelForSequenceClassification.from_pretrained(model_id, file_name=file_name)
tokenizer = AutoTokenizer.from_pretrained(model_id)

加载模型后，使用 ONNX 模型进行推理的方式与普通 Transformer 模型相同：

best_thresholds = [0.36734693877551017, 0.2857142857142857, 0.2857142857142857, 0.16326530612244897, 0.14285714285714285, 0.14285714285714285, 0.18367346938775508, 0.3469387755102041, 0.32653061224489793, 0.22448979591836732, 0.2040816326530612, 0.2857142857142857, 0.18367346938775508, 0.2857142857142857, 0.24489795918367346, 0.7142857142857142, 0.02040816326530612, 0.3061224489795918, 0.44897959183673464, 0.061224489795918366, 0.18367346938775508, 0.04081632653061224, 0.08163265306122448, 0.1020408163265306, 0.22448979591836732, 0.3877551020408163, 0.3469387755102041, 0.24489795918367346]
LABELS = ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
ID2LABEL = dict(enumerate(LABELS))

def predict_emotions(text):
  inputs = tokenizer(text, truncation=True, add_special_tokens=True, max_length=128, return_tensors='pt')
  with torch.no_grad():
      logits = model(**inputs).logits
  probas = torch.sigmoid(logits).squeeze(dim=0)  
  class_binary_labels = (probas > torch.tensor(best_thresholds)).int()
  return [ID2LABEL[label_id] for label_id, value in enumerate(class_binary_labels) if value == 1]

print(predict_emotions('У вас отличный сервис и лучший кофе в городе, обожаю вашу кофейню!'))
#['admiration', 'love']