xlm-roberta-large-it-mnli开源模型 - 支持多语言的意大利语零样本文本分类

首页

Xlm Roberta Large It Mnli

由 Jiva 开发

基于xlm-roberta-large微调的意大利语零样本分类模型，支持多语言文本分类

文本分类

Transformers

其他开源协议:MIT #意大利语零样本分类 #多语言NLI #自动翻译微调

下载量 937

发布时间 : 3/2/2022

模型简介

该模型在从MNLI语料库自动翻译的意大利语子集上进行微调，专用于意大利语文本的零样本分类，也可用于其他语言的分类任务。

模型特点

多语言支持

基于XLM-RoBERTa-large预训练，支持100种语言的文本分类

零样本分类

无需特定领域训练即可对新类别进行分类

多标签分类

支持同时为文本分配多个相关标签

模型能力

意大利语文本分类

跨语言文本分类

多标签分类

自然语言推理

使用案例

文本分类

历史文本分类

对历史相关文本进行分类，识别其主题

准确区分战争、历史等类别

地理信息分类

对地理相关文本进行分类

准确识别地理相关内容

🚀 XLM-roBERTa-large-it-mnli

这个模型基于xlm-roberta-large，在MNLI语料库的自动翻译版本的NLI数据子集上进行微调。它主要用于零样本文本分类任务，能对意大利语等多种语言的文本进行分类。

🚀 快速开始

零样本分类管道加载模型

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="Jiva/xlm-roberta-large-it-mnli", device=0, use_fast=True, multi_label=True)

分类示例

# 我们将对以下关于撒丁岛的维基百科条目进行分类
sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
# 我们可以用意大利语指定候选标签
candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
classifier(sequence_to_classify, candidate_labels)
# {'labels': ['geografia', 'moda', 'politica', 'macchine', 'cibo'],
# 'scores': [0.38871392607688904, 0.22633370757102966, 0.19398456811904907, 0.13735772669315338, 0.13708525896072388]}

指定假设模板

sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
hypothesis_template = "si parla di {}"
# classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
# 'scores': [0.6068345904350281, 0.34715887904167175, 0.32433947920799255, 0.3068877160549164, 0.18744681775569916]}

手动使用PyTorch

# 将序列作为NLI前提，标签作为假设
from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('Jiva/xlm-roberta-large-it-mnli')
tokenizer = AutoTokenizer.from_pretrained('Jiva/xlm-roberta-large-it-mnli')
premise = sequence
hypothesis = f'si parla di {}.'
# 通过在MNLI上预训练的模型运行
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                     truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]
# 我们去掉“中立”（维度1），并将“蕴含”（2）的概率作为标签为真的概率
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]

✨ 主要特性

多语言支持：基于预训练的xlm-roberta-large模型，该模型在100种不同语言上进行了预训练，因此除了意大利语，在其他语言的零样本文本分类任务中也表现出一定的有效性。
零样本分类：可用于零样本的文本分类任务，无需针对特定任务进行大量的标注数据训练。
微调优化：在MNLI语料库的自动翻译版本的NLI数据子集上进行微调，提高了在意大利语相关任务上的性能。

📦 安装指南

文档未提及具体安装步骤，可参考Hugging Face Transformers库的安装方法：

pip install transformers

💻 使用示例

基础用法

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="Jiva/xlm-roberta-large-it-mnli", device=0, use_fast=True, multi_label=True)
sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
result = classifier(sequence_to_classify, candidate_labels)
print(result)

高级用法

# 指定假设模板
sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
hypothesis_template = "si parla di {}"
result = classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
print(result)

📚 详细文档

模型描述

该模型以xlm-roberta-large为基础，在从MNLI语料库的自动翻译版本中提取的NLI数据子集上进行微调。它旨在用于零样本文本分类，例如使用Hugging Face的ZeroShotClassificationPipeline。

预期用途

此模型旨在用于意大利语文本的零样本分类。由于基础模型在100种不同语言上进行了预训练，因此该模型在上述语言之外的其他语言中也显示出一定的有效性。有关预训练语言的完整列表，请参阅XLM Roberata论文的附录A。对于仅英语的分类任务，建议使用bart-large-mnli或蒸馏的bart MNLI模型。

🔧 技术细节

版本0.1

该模型现在已在完整的训练集上进行了重新训练。由于翻译模型的错误翻译，大约1000个句子对已从数据集中移除。

指标	值
学习率	4e-6
优化器	AdamW
批量大小	80
MCC	0.77
训练损失	0.34
评估损失	0.40
停止步骤	9754

版本0.0

该模型在100种语言的数据集上进行了预训练，如原始论文所述。然后在MNLI数据集的意大利语翻译版本上针对NLI任务进行了微调（到目前为止仅使用了训练集的85%）。用于翻译文本的模型是Helsinki-NLP/opus-mt-en-it，最大输出序列长度为120。该模型以学习率4e-6和批量大小80进行了1个epoch的训练，目前在剩余15%的训练集上的准确率为82%。