🚀 多语言礼貌分类模型
本模型基于xlm - roberta - large
,并在TyDiP数据集的英语子集上进行了微调,相关内容可参考原论文此处。该模型可用于文本分类任务,能对多种语言的文本进行礼貌程度分类。
🚀 快速开始
本模型基于xlm - roberta - large
,在TyDiP数据集的英语子集上进行微调,可用于多语言的礼貌分类。
✨ 主要特性
- 多语言支持:在论文中,该模型在英语以及其他9种语言(印地语、韩语、西班牙语、泰米尔语、法语、越南语、俄语、南非荷兰语、匈牙利语)上进行了评估。鉴于模型的良好性能和XLMR的跨语言能力,微调后的模型很可能也适用于更多语言。
- 基于强大基础模型:基于
xlm - roberta - large
进行微调,充分利用了其预训练的语言知识。
📦 安装指南
文档未提及具体安装步骤,可通过transformers
库使用该模型,确保已安装transformers
库:
pip install transformers
💻 使用示例
基础用法
from transformers import pipeline
classifier = pipeline(task="text-classification", model="Genius1237/xlm-roberta-large-tydip")
sentences = ["Could you please get me a glass of water", "mere liye पानी का एक गिलास ले आओ "]
print(classifier(sentences))
高级用法
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained('Genius1237/xlm-roberta-large-tydip')
model = AutoModelForSequenceClassification.from_pretrained('Genius1237/xlm-roberta-large-tydip')
text = "Could you please get me a glass of water"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
prediction = torch.argmax(output.logits).item()
print(model.config.id2label[prediction])
📚 详细文档
评估结果
TyDiP测试集上10种语言的礼貌分类准确率分数如下:
语言 |
准确率 |
英语 (en) |
0.892 |
印地语 (hi) |
0.868 |
韩语 (ko) |
0.784 |
西班牙语 (es) |
0.84 |
泰米尔语 (ta) |
0.78 |
法语 (fr) |
0.82 |
越南语 (vi) |
0.844 |
俄语 (ru) |
0.668 |
南非荷兰语 (af) |
0.856 |
匈牙利语 (hu) |
0.812 |
引用信息
@inproceedings{srinivasan-choi-2022-tydip,
title = "{T}y{D}i{P}: A Dataset for Politeness Classification in Nine Typologically Diverse Languages",
author = "Srinivasan, Anirudh and
Choi, Eunsol",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.420",
doi = "10.18653/v1/2022.findings-emnlp.420",
pages = "5723--5738",
abstract = "We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels {--} they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy{'}s impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.",
}
📄 许可证
本项目采用MIT许可证。