xlm-roberta-large-tydip开源模型 - 支持10种语言的多语言礼貌度判断

首页

Xlm Roberta Large Tydip

由 Genius1237 开发

基于xlm-roberta-large架构的多语言礼貌度分类模型，在TyDiP数据集的英语子集上微调，支持10种语言的礼貌度判断

文本分类

Transformers

支持多种语言开源协议:MIT #多语言礼貌分析 #跨语言文本分类 #高准确率XLMR

下载量 929

发布时间 : 4/20/2023

模型简介

该模型用于判断文本的礼貌程度（礼貌/不礼貌），特别针对多语言场景设计，在英语及9种其他语言上表现出色

模型特点

多语言支持

支持10种语言的礼貌度分类，包括印地语、韩语等非拉丁语系语言

高准确率

在英语测试集上达到0.892的准确率，其他语言也表现良好

跨语言能力

基于XLMR架构，具备优秀的跨语言迁移能力，可能适用于更多语言

模型能力

多语言文本分类

礼貌度判断

跨语言迁移学习

使用案例

社交媒体分析

评论礼貌度筛选

自动识别社交媒体评论的礼貌程度

可帮助过滤不礼貌内容

客服系统

客服回复质量监控

评估客服回复的礼貌程度

提升客户服务质量

🚀 多语言礼貌分类模型

本模型基于xlm - roberta - large，并在TyDiP数据集的英语子集上进行了微调，相关内容可参考原论文此处。该模型可用于文本分类任务，能对多种语言的文本进行礼貌程度分类。

🚀 快速开始

本模型基于xlm - roberta - large，在TyDiP数据集的英语子集上进行微调，可用于多语言的礼貌分类。

✨ 主要特性

多语言支持：在论文中，该模型在英语以及其他9种语言（印地语、韩语、西班牙语、泰米尔语、法语、越南语、俄语、南非荷兰语、匈牙利语）上进行了评估。鉴于模型的良好性能和XLMR的跨语言能力，微调后的模型很可能也适用于更多语言。
基于强大基础模型：基于xlm - roberta - large进行微调，充分利用了其预训练的语言知识。

📦 安装指南

文档未提及具体安装步骤，可通过transformers库使用该模型，确保已安装transformers库：

pip install transformers

💻 使用示例

基础用法

from transformers import pipeline

classifier = pipeline(task="text-classification", model="Genius1237/xlm-roberta-large-tydip")

sentences = ["Could you please get me a glass of water", "mere liye पानी का एक गिलास ले आओ "]

print(classifier(sentences))
# [{'label': 'polite', 'score': 0.9076159000396729}, {'label': 'impolite', 'score': 0.765066385269165}]

高级用法

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained('Genius1237/xlm-roberta-large-tydip')
model = AutoModelForSequenceClassification.from_pretrained('Genius1237/xlm-roberta-large-tydip')

text = "Could you please get me a glass of water"
encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)
prediction = torch.argmax(output.logits).item()

print(model.config.id2label[prediction])
# polite

📚 详细文档

评估结果

TyDiP测试集上10种语言的礼貌分类准确率分数如下：

语言	准确率
英语 (en)	0.892
印地语 (hi)	0.868
韩语 (ko)	0.784
西班牙语 (es)	0.84
泰米尔语 (ta)	0.78
法语 (fr)	0.82
越南语 (vi)	0.844
俄语 (ru)	0.668
南非荷兰语 (af)	0.856
匈牙利语 (hu)	0.812

引用信息

@inproceedings{srinivasan-choi-2022-tydip,
    title = "{T}y{D}i{P}: A Dataset for Politeness Classification in Nine Typologically Diverse Languages",
    author = "Srinivasan, Anirudh  and
      Choi, Eunsol",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.420",
    doi = "10.18653/v1/2022.findings-emnlp.420",
    pages = "5723--5738",
    abstract = "We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels {--} they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy{'}s impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.",
}