GIST-small-Embedding-v0开源文本嵌入模型 - 优化检索查询编码能力免费使用

首页

GIST Small Embedding V0

由 avsolatorio 开发

基于BAAI/bge-small-en-v1.5模型微调的文本嵌入模型，通过MEDI数据集与MTEB分类任务数据集训练，优化了检索任务的查询编码能力。

文本嵌入

Safetensors

英语开源协议:MIT #无指令嵌入 #跨任务微调 #语义相似度计算

下载量 945.68k

发布时间 : 2/3/2024

模型简介

该模型生成嵌入向量时无需指令输入，可直接编码查询语句，适用于文本检索和相似度计算任务。

模型特点

无需指令输入

生成嵌入向量时无需构造提示语句，直接编码查询即可。

融合多数据集训练

结合MEDI数据集与MTEB分类任务数据集进行微调，提升模型性能。

优化检索任务

针对检索任务优化，显著提升部分任务的性能表现。

模型能力

文本嵌入生成

文本相似度计算

检索任务优化

使用案例

信息检索

文档检索

用于快速检索相关文档或段落。

在部分MTEB任务中表现显著提升

相似度计算

文本相似度分析

计算两段文本的语义相似度。

🚀 GIST small Embedding v0

GIST small Embedding v0是一个文本嵌入微调模型，基于Sentence Transformers库，无需额外指令即可生成嵌入。该模型在特定数据集上微调，在部分任务上有显著提升，可用于文本检索、相似度计算等自然语言处理任务。

🚀 快速开始

该模型可以使用Sentence Transformers库轻松加载。

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # 若模型更新，替换为特定版本以确保可重复性。

model = SentenceTransformer("avsolatorio/GIST-small-Embedding-v0", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# 计算嵌入
embeddings = model.encode(texts, convert_to_tensor=True)

# 计算每对句子的余弦相似度
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

✨ 主要特性

无需指令：模型在生成嵌入时不需要任何指令，检索任务的查询可以直接编码，无需编写指令。
微调提升：在特定数据集上进行微调，在某些任务上相比基础模型有显著性能提升，但在部分任务上可能会出现性能下降。

📦 安装指南

使用以下命令安装Sentence Transformers库：

pip install sentence-transformers

📚 详细文档

数据

使用的数据集是MEDI和MTEB Classification训练数据集的汇编。第三方数据集可能会根据其相关许可证受到额外的条款和条件限制。可以获取编译数据集的HuggingFace Dataset版本以及用于训练模型的特定版本：

数据集：avsolatorio/medi-data-mteb_avs_triplets
版本：238a0499b6e6b690cc64ea56fde8461daa8341bb

数据集包含一个task_type键，可用于仅选择mteb分类任务（以mteb_为前缀）。

MEDI数据集发表于以下论文：One Embedder, Any Task: Instruction-Finetuned Text Embeddings。

GIST嵌入模型与基础模型相比的MTEB基准测试结果表明，微调数据集对模型产生了相当大的影响，导致在某些任务上有显著改进，而在某些任务上性能下降。

值得注意的是TRECCOVID任务的检索性能。微调数据集不包含关于COVID - 19的重要知识，这可能导致了观察到的性能下降。我们在论文中详细阐述了一些证据，表明微调数据的主题覆盖范围会影响下游性能。

训练参数

以下是用于微调模型的训练参数：

Epochs = 40
Warmup ratio = 0.1
Learning rate = 5e-6
Batch size = 16
Checkpoint step = 102000
Contrastive loss temperature = 0.01

评估

模型使用MTEB Evaluation套件进行评估。

🔧 技术细节

该模型基于BAAI/bge-small-en-v1.5，使用MEDI和MTEB Classification训练数据集进行微调。微调过程中没有使用额外的指令，直接对文本进行编码生成嵌入。技术论文可参考：GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning。

📄 许可证

本项目遵循MIT许可证。

📖 引用

如果您在项目或研究中使用GISTEmbed或我们发布的数据集，请引用我们的工作。🤗

@article{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    journal={arXiv preprint arXiv:2402.16829},
    year={2024},
    URL={https://arxiv.org/abs/2402.16829},
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

🙏 致谢

这项工作得到了世界银行知识促进发展计划（KCP）资助的“KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)”项目的支持 - RA - P503405 - RESE - TF0C3444。

本材料中表达的研究结果、解释和结论完全属于作者，不一定代表国际复兴开发银行/世界银行及其附属组织的观点，也不一定代表世界银行执行董事或他们所代表的政府的观点。