开源sbert-roberta-large-anli-mnli-snli模型 - 精准完成句子相似度对比任务

首页

Sbert Roberta Large Anli Mnli Snli

由 usc-isi 开发

基于RoBERTa-large的句子转换模型，专为句子相似度任务设计，在ANLI、MNLI和SNLI数据集上训练

文本嵌入

Transformers

英语#句子语义嵌入 #NLI任务优化 #多数据集训练

下载量 38

发布时间 : 3/2/2022

模型简介

该模型能将句子和段落映射到768维向量空间，适用于语义搜索、聚类等自然语言处理任务

模型特点

高质量句子嵌入

基于RoBERTa-large架构，生成高质量的句子嵌入表示

多数据集训练

在ANLI、MNLI和SNLI三个权威自然语言推理数据集上联合训练

高效池化策略

采用均值池化方法，有效聚合词嵌入信息

模型能力

句子向量化

语义相似度计算

文本聚类

语义搜索

使用案例

信息检索

语义搜索系统

构建基于语义而非关键词的搜索系统

提高搜索结果的相关性

文本分析

文档聚类

将语义相似的文档自动分组

实现无监督的文档组织

自然语言理解

句子相似度计算

计算两个句子之间的语义相似度

可用于问答系统、复述检测等应用

🚀 sbert-roberta-large-anli-mnli-snli

这是一个 sentence-transformers 模型，它可以将句子和段落映射到一个 768 维的密集向量空间，可用于聚类或语义搜索等任务。

模型信息

属性	详情
模型类型	句子相似度模型
训练数据	ANLI、Multi NLI、SNLI
标签	sentence-transformers、feature-extraction、sentence-similarity、transformers

训练详情

学习率：2e-5
批量大小：8
池化方法：Mean
训练时间：在一块 NVIDIA GeForce RTX 2080 Ti 上训练约 20 小时

该模型以 RoBERTa-large 进行权重初始化，并使用示例脚本 training_nli.py 在 ANLI（Nie 等人，2020）、MNLI（Williams 等人，2018）和 SNLI（Bowman 等人，2015）上进行训练。

🚀 快速开始

安装依赖

若已安装 sentence-transformers，使用该模型将变得十分简单：

pip install -U sentence-transformers

使用示例

基础用法（Sentence-Transformers）

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("usc-isi/sbert-roberta-large-anli-mnli-snli")
embeddings = model.encode(sentences)
print(embeddings)

高级用法（Hugging Face Transformers）

若未安装 sentence-transformers，可按以下方式使用该模型：首先将输入传递给 Transformer 模型，然后对上下文词嵌入应用正确的池化操作。

import torch
from transformers import AutoModel, AutoTokenizer


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("usc-isi/sbert-roberta-large-anli-mnli-snli")
model = AutoModel.from_pretrained("usc-isi/sbert-roberta-large-anli-mnli-snli")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 详细文档

评估结果

评估结果请参阅论文的 4.1 节。

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📖 引用与作者

有关该项目的更多信息，请参阅我们的论文：

Ciosici, Manuel, et al. "Machine-Assisted Script Curation." Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Association for Computational Linguistics, 2021, pp. 8–17. ACLWeb, https://www.aclweb.org/anthology/2021.naacl-demos.2.

参考文献

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. AdversarialNLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.