Bloomz-3b-nli开源模型 - 免费实现英法双语语义关系自然语言推理

首页

Bloomz 3b Nli

由 cmarkea 开发

基于Bloomz-3b-chat-dpo微调的自然语言推理模型，支持英法双语语义关系判断

大型语言模型

Transformers

支持多种语言开源协议:Openrail #零样本分类 #多语言推理 #语义关系识别

下载量 22

发布时间 : 11/28/2023

模型简介

该模型专注于自然语言推理任务，能够判断两个句子之间的逻辑关系（蕴含/矛盾/中立），并具备零样本分类能力。采用语言无关方式训练，支持英语和法语的任意组合输入。

模型特点

双语混合推理

支持英语和法语的任意组合输入，在跨语言场景下保持高准确率

零样本分类

无需特定训练即可对任意文本进行多标签分类，适用于开放域场景

长文本理解

相比传统NLI模型，能更好处理复杂长文本结构的语义分析

模型能力

自然语言推理

跨语言文本分类

语义关系判断

零样本学习

使用案例

情感分析

影评情感分类

对电影评论进行积极/消极情感判断

在Allociné数据集上达到89.06%准确率

内容分类

多语言新闻分类

对英法混合新闻进行主题分类（如政治/科技/体育等）

🚀 Bloomz-3b-NLI模型

Bloomz-3b-NLI模型是基于自然语言推理（NLI）任务训练的模型，它从基础模型 Bloomz-3b-chat-dpo 微调而来。该模型以与语言无关的方式进行训练，能够处理英语和法语的文本，在零样本分类任务中表现出色。

🚀 快速开始

以下是使用 transformers 库调用 Bloomz-3b-NLI 模型进行零样本分类的示例代码：

from transformers import pipeline

classifier = pipeline(
    task='zero-shot-classification',
    model="cmarkea/bloomz-3b-nli"
)
result = classifier (
    sequences="Le style très cinéphile de Quentin Tarantino "
    "se reconnaît entre autres par sa narration postmoderne "
    "et non linéaire, ses dialogues travaillés souvent "
    "émaillés de références à la culture populaire, et ses "
    "scènes hautement esthétiques mais d'une violence "
    "extrême, inspirées de films d'exploitation, d'arts "
    "martiaux ou de western spaghetti.",
    candidate_labels="cinéma, technologie, littérature, politique",
    hypothesis_template="Ce texte parle de {}."
)

result
{"labels": ["cinéma",
            "littérature",
            "technologie",
            "politique"],
 "scores": [0.8745610117912292,
            0.10403601825237274,
            0.014962797053158283,
            0.0064402492716908455]}

# 跨语言英法语境下的鲁棒性
result = classifier (
    sequences="Quentin Tarantino's very cinephile style is "
    "recognized, among other things, by his postmodern and "
    "non-linear narration, his elaborate dialogues often "
    "peppered with references to popular culture, and his "
    "highly aesthetic but extremely violent scenes, inspired by "
    "exploitation films, martial arts or spaghetti western.",
    candidate_labels="cinéma, technologie, littérature, politique",
    hypothesis_template="Ce texte parle de {}."
)

result
{"labels": ["cinéma",
            "littérature",
            "technologie",
            "politique"],
 "scores": [0.9314399361610413,
            0.04960821941494942,
            0.013468802906572819,
            0.005483036395162344]}

✨ 主要特性

语言无关性：假设和前提在英语和法语之间随机选择，每种语言组合的概率为 25%。
零样本分类能力：能够对任何文本进行分类，无需特定训练。
处理复杂文本：与 BERT、RoBERTa 或 CamemBERT 等模型相比，能够从更复杂和冗长的文本结构中建模和提取信息。

📚 详细文档

模型介绍

Bloomz-3b-NLI 模型是从 Bloomz-3b-chat-dpo 基础模型微调而来，用于自然语言推理（NLI）任务。NLI 任务旨在确定假设和一组前提之间的语义关系，通常表示为句子对。

语言无关性方法

假设和前提在英语和法语之间随机选择，每种语言组合的概率为 25%。

性能评估

自然语言推理任务

类别	准确率 (%)	F1 分数 (%)	样本数
总体	81.96	81.07	5,010
矛盾	81.80	84.04	1,670
蕴含	84.82	81.96	1,670
中立	76.85	77.20	1,670

基准测试

假设和前提均为法语 | 模型 | 准确率 (%) | MCC (x100) | | ---- | ---- | ---- | | cmarkea/distilcamembert-base-nli | 77.45 | 66.24 | | BaptisteDoyen/camembert-base-xnli | 81.72 | 72.67 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 83.43 | 75.15 | | cmarkea/bloomz-560m-nli | 68.70 | 53.57 | | cmarkea/bloomz-3b-nli | 81.08 | 71.66 | | cmarkea/bloomz-7b1-mt-nli | 83.13 | 74.89 |
假设为法语，前提为英语（跨语言语境） | 模型 | 准确率 (%) | MCC (x100) | | ---- | ---- | ---- | | cmarkea/distilcamembert-base-nli | 16.89 | -26.82 | | BaptisteDoyen/camembert-base-xnli | 74.59 | 61.97 | | MoritzLaurer/mDeBERTa-v3-base-mnli-xnli | 85.15 | 77.74 | | cmarkea/bloomz-560m-nli | 68.84 | 53.55 | | cmarkea/bloomz-3b-nli | 82.12 | 73.22 | | cmarkea/bloomz-7b1-mt-nli | 85.43 | 78.25 |

零样本分类任务

零样本分类任务可以总结为： $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$ 其中，i 表示由模板（例如，“This text is about {}. ”）和 #C 候选标签（“cinema”、“politics” 等）组成的假设。假设集由 {"This text is about cinema.", "This text is about politics.", ...} 组成。我们将这些假设与前提（即我们要分类的句子）进行比较。

零样本分类性能

模型在法国电影评论网站 Allociné 上进行情感分析评估。数据集被标记为 2 类，即 20,000 条评论中的正面评论和负面评论。我们使用假设模板 “Ce commentaire est {}.” 和候选类别 “positif” 和 “negatif”。

模型	准确率 (%)	MCC (x100)
cmarkea/distilcamembert-base-nli	80.59	63.71
BaptisteDoyen/camembert-base-xnli	86.37	73.74
MoritzLaurer/mDeBERTa-v3-base-mnli-xnli	84.97	70.05
cmarkea/bloomz-560m-nli	71.13	46.3
cmarkea/bloomz-3b-nli	89.06	78.10
cmarkea/bloomz-7b1-mt-nli	95.12	90.27

🔧 技术细节

自然语言推理任务

目标是预测文本蕴含关系（句子 A 是否蕴含/矛盾/中立于句子 B？），这是一个分类任务（给定两个句子，预测三个标签之一）。如果句子 A 称为前提，句子 B 称为假设，则建模的目标是估计以下概率： $$P(premise=c\in{contradiction, entailment, neutral}\vert hypothesis)$$

零样本分类任务

零样本分类任务可以通过以下公式总结： $$P(hypothesis=i\in\mathcal{C}|premise)=\frac{e^{P(premise=entailment\vert hypothesis=i)}}{\sum_{j\in\mathcal{C}}e^{P(premise=entailment\vert hypothesis=j)}}$$

📄 许可证

本模型使用的许可证为 bigscience-bloom-rail-1.0。

📖 引用

@online{DeBloomzNLI,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-nli},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}

📋 信息表格

属性	详情
模型类型	Bloomz-3b-NLI
训练数据	xnli
基础模型	cmarkea/bloomz-3b-dpo-chat
支持语言	法语、英语
任务类型	零样本分类
许可证	bigscience-bloom-rail-1.0