gliclass-x-base开源零样本分类器 - 高效完成多语言文本分类任务

首页

Gliclass X Base

由 knowledgator 开发

GLiClass 是一个高效的零样本分类器，性能与交叉编码器相当但计算效率更高，支持多语言文本分类任务。

文本分类

Safetensors

开源协议:Apache-2.0 #零样本分类 #多语言理解 #高效推理

下载量 181

发布时间 : 7/17/2025

模型简介

GLiClass 是一个通用轻量级序列分类模型，适用于主题分类、情感分析等任务，也可在 RAG 管道中作为重排器使用。

模型特点

高效零样本分类

只需一次前向传播即可完成分类，计算效率高于传统交叉编码器

多语言支持

基于 mdeberta-v3-base 骨干模型，具备优秀的多语言理解能力

商业友好

使用合成数据和可商用许可数据训练，适合商业应用场景

轻量级设计

模型参数规模优化，在保持性能的同时提高推理速度

模型能力

零样本文本分类

多标签分类

多语言文本理解

RAG管道重排

使用案例

文本分类

主题分类

对文本进行主题分类，如旅行、科学、政治等

在通用基准测试中平均 F1 分数达 0.5737

情感分析

分析文本情感倾向

在 IMDb 数据集上 F1 分数达 0.8840

RAG 应用

检索结果重排

在检索增强生成(RAG)管道中作为重排器使用

🚀 ⭐ GLiClass：用于序列分类的通用轻量级模型

GLiClass 是一个高效的零样本分类器，受 GLiNER 工作的启发而开发。它在分类性能上与交叉编码器相当，但计算效率更高，因为它只需一次前向传播即可完成分类。

该模型可用于 主题分类、情感分析，还能在 RAG 管道中作为重排器使用。模型基于合成数据和可商用的许可数据进行训练，因此可应用于商业场景。其骨干模型为 mdeberta-v3-base，支持多语言理解，非常适合处理不同语言的文本任务。

🚀 快速开始

安装 GLiClass 库

首先，你需要安装 GLiClass 库：

pip install gliclass
pip install -U transformers>=4.48.0

初始化模型和管道

以下是不同语言的使用示例：

英语

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

text = "One day I will see the world!"
labels = ["travel", "dreams", "sport", "science", "politics"]
results = pipeline(text, labels, threshold=0.5)[0] #because we have one text
for result in results:
 print(result["label"], "=>", result["score"])

西班牙语

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

text = "¡Un día veré el mundo!"
labels = ["viajes", "sueños", "deportes", "ciencia", "política"]
results = pipeline(text, labels, threshold=0.5)[0]
for result in results:
    print(result["label"], "=>", result["score"])

意大利语

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

text = "Un giorno vedrò il mondo!"
labels = ["viaggi", "sogni", "sport", "scienza", "politica"]
results = pipeline(text, labels, threshold=0.5)[0]
for result in results:
    print(result["label"], "=>", result["score"])

法语

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

text = "Un jour, je verrai le monde!"
labels = ["voyage", "rêves", "sport", "science", "politique"]
results = pipeline(text, labels, threshold=0.5)[0]
for result in results:
    print(result["label"], "=>", result["score"])

德语

from gliclass import GLiClassModel, ZeroShotClassificationPipeline
from transformers import AutoTokenizer

model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

text = "Eines Tages werde ich die Welt sehen!"
labels = ["Reisen", "Träume", "Sport", "Wissenschaft", "Politik"]
results = pipeline(text, labels, threshold=0.5)[0]
for result in results:
    print(result["label"], "=>", result["score"])

📊 基准测试

以下是该模型在几个文本分类数据集上的 F1 分数。所有测试模型均未在这些数据集上进行微调，而是在零样本设置下进行测试。

多语言基准测试

数据集	gliclass-x-base	gliclass-base-v3.0	gliclass-large-v3.0
FredZhang7/toxi-text-3M	0.5972	0.5072	0.6118
SetFit/xglue_nc	0.5014	0.5348	0.5378
Davlan/sib200_14classes	0.4663	0.2867	0.3173
uhhlt/GermEval2017	0.3999	0.4010	0.4299
dolfsai/toxic_es	0.1250	0.1399	0.1412
平均	0.41796	0.37392	0.4076

通用基准测试

数据集	gliclass-x-base	gliclass-base-v3.0	gliclass-large-v3.0
SetFit/CR	0.8630	0.9127	0.9398
SetFit/sst2	0.8554	0.8959	0.9192
SetFit/sst5	0.3287	0.3376	0.4606
AmazonScience/massive	0.2611	0.5040	0.5649
stanfordnlp/imdb	0.8840	0.9251	0.9366
SetFit/20_newsgroups	0.4116	0.4759	0.5958
SetFit/enron_spam	0.5929	0.6760	0.7584
PolyAI/banking77	0.3098	0.4698	0.5574
takala/financial_phrasebank	0.7851	0.8971	0.9000
ag_news	0.6815	0.7279	0.7181
dair-ai/emotion	0.3667	0.4447	0.4506
MoritzLaurer/cap_sotu	0.3935	0.4614	0.4589
cornell/rotten_tomatoes	0.7252	0.7943	0.8411
平均	0.5737	0.6556	0.7001