pipeline_tag: 句子相似度
tags:
- 特征提取
- 句子相似度
language: 英文
license: apache-2.0
mutual information Contrastive Sentence Embedding (miCSE) 低样本量句子嵌入模型
论文已被ACL 2023收录

模型简介
miCSE语言模型专为句子相似度计算而训练。在对比学习过程中,该模型通过强制不同视角(数据增强后的嵌入)之间的注意力模式对齐进行训练。直观地说,miCSE通过学习强制实现__不同dropout增强视角间的句法一致性__。实际操作中,这是通过正则化自注意力分布实现的。通过在训练过程中正则化自注意力,表征学习变得更具样本效率。因此,即使在训练集规模有限的情况下,自监督学习也变得可行。这一特性使miCSE在__实际应用场景__中特别有价值,因为现实场景中的训练数据通常有限。
模型应用场景
本模型适用于对句子或短段落进行编码。给定输入文本后,模型会生成捕捉语义信息的向量嵌入。句子表征对应于_[CLS]_标记的嵌入。该嵌入可用于多种任务,如检索、句子相似度比较(参见示例1)或聚类(参见示例2)。
训练数据
模型基于从维基百科随机收集的英文句子进行训练。完整训练文件可在此下载。低样本量训练数据包含不同规模(从10%到0.0064%)的SimCSE训练语料分割。每个规模分割包含5个文件,通过不同种子生成(文件名后缀标示)。数据可在此下载。
模型训练
要充分利用miCSE的少样本学习能力,需要在您的数据上对模型进行训练。论文中使用的源代码和数据分割可在此获取。
模型使用
示例1) - 句子相似度计算
点击展开
from transformers import AutoTokenizer, AutoModel
import torch.nn as nn
tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/miCSE")
model = AutoModel.from_pretrained("sap-ai-research/miCSE")
max_length = 32
sentences = [
"This is a sentence for testing miCSE.",
"This is yet another test sentence for the mutual information Contrastive Sentence Embeddings model."
]
batch = tokenizer.batch_encode_plus(
sentences,
return_tensors='pt',
padding=True,
max_length=max_length,
truncation=True
)
with torch.no_grad():
outputs = model(**batch, output_hidden_states=True, return_dict=True)
embeddings = outputs.last_hidden_state[:,0]
sim = nn.CosineSimilarity(dim=-1)
cos_sim = sim(embeddings.unsqueeze(1), embeddings.unsqueeze(0))
print(f"相似度: {cos_sim[0,1].detach().item()}")
示例2) - 聚类分析
点击展开
from transformers import AutoTokenizer, AutoModel
import torch.nn as nn
import torch
import numpy as np
import tqdm
from datasets import load_dataset
import umap
import umap.plot as umap_plot
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("/Users/d065243/miCSE")
model = AutoModel.from_pretrained("/Users/d065243/miCSE").to(device)
dataset = load_dataset("tweet_eval", "sentiment")
batch_size = 50
max_length = 128
embedding_stack = []
classes = []
for i in tqdm.notebook.tqdm(range(0,len(dataset['train']),batch_size)):
batch = tokenizer.batch_encode_plus(
dataset['train'][i:i+batch_size]['text'],
return_tensors='pt',
padding=True,
max_length=max_length,
truncation=True
).to(device)
classes += dataset['train'][i:i+batch_size]['label']
with torch.no_grad():
outputs = model(**batch, output_hidden_states=True, return_dict=True)
embedding_stack.append(outputs.last_hidden_state[:,0].cpu())
embeddings = torch.vstack(embedding_stack)
umap_model = umap.UMAP(n_neighbors=250, n_components=2, metric='cosine').fit(embeddings)
umap_plot.points(umap_model, labels=np.array(classes), theme='fire')

点击展开
from sentence_transformers import SentenceTransformer, models
import torch.nn as nn
model = SentenceTransformer(modules=[
models.Transformer('sap-ai-research/miCSE', max_seq_length=32),
models.Pooling(models.Transformer.get_word_embedding_dimension())
])
sentences1 = ["测试miCSE的句子", "使用互信息对比句子嵌入模型"]
sentences2 = ["miCSE测试", "与miCSE的相似度"]
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
cos_sim = nn.CosineSimilarity(dim=-1)(embeddings1, embeddings2)
for i in range(len(sentences1)):
print(f"相似度{cos_sim[i]:.2f}: {sentences1[i]} <<对比>> {sentences2[i]}")
基准测试
SentEval基准测试结果:
点击展开
+-------+-------+-------+-------+-------+--------------+-----------------+--------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | S.Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+--------+
| 71.71 | 83.09 | 75.46 | 83.13 | 80.22 | 79.70 | 73.62 | 78.13 |
+-------+-------+-------+-------+-------+--------------+-----------------+--------+
引用
如果使用本代码或参考我们的工作,请引用:
@inproceedings{klein-nabi-2023-micse,
title = "mi{CSE}: 基于互信息对比学习的低样本量句子嵌入",
author = "Klein, Tassilo 和 Nabi, Moin",
booktitle = "ACL 2023会议论文集",
year = "2023",
pages = "6159--6177",
url = "https://aclanthology.org/2023.acl-long.339",
}
作者: