🚀 Snowflake的Arctic-embed-l
Snowflake的Arctic-embed-l是一套文本嵌入模型,专注于创建针对性能优化的高质量检索模型。该模型旨在解决文本检索中的准确性和效率问题,为用户提供更精准、高效的文本检索体验。
🚀 快速开始
环境准备
确保你已经安装了所需的Python库,如sentence-transformers
、transformers
等。
代码示例
以下是使用sentence-transformers
库调用snowflake-arctic-embed-l
模型的示例代码:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l")
queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
scores = query_embeddings @ document_embeddings.T
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
print("查询:", query)
for document, score in doc_score_pairs:
print(score, document)
✨ 主要特性
- 高性能检索:在MTEB/BEIR排行榜上,各尺寸变体的模型均达到了最先进的性能。
- 多模型选择:提供
snowflake-arctic-embed-xs
、snowflake-arctic-embed-s
、snowflake-arctic-embed-m
、snowflake-arctic-embed-m-long
和snowflake-arctic-embed-l
等多种模型,满足不同场景需求。
- 可替代闭源模型:最大的模型
snowflake-arctic-embed-l
可作为闭源嵌入的自然替代品。
📦 安装指南
使用Sentence Transformers
pip install sentence-transformers
使用Huggingface transformers
pip install transformers
使用Transformers.js
npm i @xenova/transformers
💻 使用示例
基础用法
使用Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l")
queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
scores = query_embeddings @ document_embeddings.T
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
print("查询:", query)
for document, score in doc_score_pairs:
print(score, document)
使用Huggingface transformers
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-l')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-l', add_pooling_layer=False)
model.eval()
query_prefix = 'Represent this sentence for searching relevant passages: '
queries = ['what is snowflake?', 'Where can I get the best tacos?']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
documents = ['The Data Cloud!', 'Mexico City of Course!']
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)
with torch.no_grad():
query_embeddings = model(**query_tokens)[0][:, 0]
document_embeddings = model(**document_tokens)[0][:, 0]
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
print("查询:", query)
for document, score in doc_score_pairs:
print(score, document)
使用Transformers.js
import { pipeline, dot } from '@xenova/transformers';
const extractor = await pipeline('feature-extraction', 'Snowflake/snowflake-arctic-embed-l', {
quantized: false,
});
const sentences = [
'Represent this sentence for searching relevant passages: Where can I get the best tacos?',
'The Data Cloud!',
'Mexico City of Course!',
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => dot(source_embeddings, x));
console.log(similarities);
高级用法
使用Infinity进行OpenAI兼容API部署
docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \
michaelf34/infinity:0.0.70 \
v2 --model-id Snowflake/snowflake-arctic-embed-l --dtype float16 --batch-size 32 --engine torch --port 7997
📚 详细文档
模型介绍
snowflake-arctic-embed
是一套文本嵌入模型,通过利用现有的开源文本表示模型(如bert-base-uncased
),并在多阶段管道中进行训练,以优化其检索性能。
模型对比
与闭源模型对比
模型名称 |
MTEB检索分数 (NDCG @ 10) |
snowflake-arctic-embed-l |
55.98 |
Google-gecko-text-embedding |
55.7 |
text-embedding-3-large |
55.44 |
Cohere-embed-english-v3.0 |
55.00 |
bge-large-en-v1.5 |
54.29 |
🔧 技术细节
训练过程
模型的训练分为两个阶段:
- 预训练:使用大量的查询 - 文档对进行训练,其中负样本是在批次内推导出来的。预训练利用了约4亿个公共数据集和专有网络搜索数据的混合样本。
- 优化训练:在较小的数据集(约100万个样本)上进行长时间训练,该数据集包含查询、正文档和负文档的三元组,负样本的挖掘和数据整理对检索准确性至关重要。
技术报告
详细的技术报告可以在这里找到。
📄 许可证
Arctic采用Apache 2许可证。发布的模型可以免费用于商业目的。