OpenSearch神经稀疏编码模型v1开源 - 高效实现搜索相关性及文档检索

首页

Opensearch Neural Sparse Encoding V1

由 opensearch-project 开发

OpenSearch神经稀疏编码模型v1，用于将查询和文档编码为30522维稀疏向量，实现高效的搜索相关性和检索

文本嵌入

Transformers

英语开源协议:Apache-2.0 #稀疏向量检索 #零样本搜索 #高效语义匹配

下载量 10.20k

发布时间 : 3/7/2024

模型简介

这是一个学习型稀疏检索模型，可将查询和文档编码为30522维的稀疏向量，在搜索相关性和检索效率方面表现出色。模型在MS MARCO数据集上进行训练，支持使用Lucene倒排索引进行学习型稀疏检索。

模型特点

高效稀疏编码

将查询和文档编码为30522维的稀疏向量，非零维度索引表示词汇表中对应的标记，权重表示标记的重要性

优秀的相关性表现

在BEIR基准测试的多个数据集上表现出色，平均NDCG@10达到0.524

OpenSearch集成

专为OpenSearch集群设计，支持使用Lucene倒排索引进行高效检索

零样本性能

在未见过的数据集上也能表现良好，无需微调即可使用

模型能力

文本稀疏编码

信息检索

查询-文档匹配

零样本迁移学习

使用案例

搜索引擎

文档检索

在大型文档集合中高效检索相关文档

在BEIR基准测试中平均NDCG@10达到0.524

问答系统

匹配用户问题与候选答案

在NQ数据集上NDCG@10达到0.553

专业领域搜索

科学文献检索

在科学文献数据库中检索相关论文

在SciFact数据集上NDCG@10达到0.723

医疗信息检索

检索医疗相关文档和信息

在TrecCovid数据集上NDCG@10达到0.771

🚀 opensearch-neural-sparse-encoding-v1

本项目是一个学习型稀疏检索模型，可将查询和文档编码为30522维的稀疏向量，在搜索相关性和检索效率方面表现出色。

🚀 快速开始

本模型应在OpenSearch集群中运行，但也可以使用HuggingFace模型API在集群外使用。以下是使用示例：

import itertools
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer


# get sparse vector from dense vectors with shape batch_size * seq_len * vocab_size
def get_sparse_vector(feature, output):
    values, _ = torch.max(output*feature["attention_mask"].unsqueeze(-1), dim=1)
    values = torch.log(1 + torch.relu(values))
    values[:,special_token_ids] = 0
    return values
    
# transform the sparse vector to a dict of (token, weight)
def transform_sparse_vector_to_dict(sparse_vector):
    sample_indices,token_indices=torch.nonzero(sparse_vector,as_tuple=True)
    non_zero_values = sparse_vector[(sample_indices,token_indices)].tolist()
    number_of_tokens_for_each_sample = torch.bincount(sample_indices).cpu().tolist()
    tokens = [transform_sparse_vector_to_dict.id_to_token[_id] for _id in token_indices.tolist()]

    output = []
    end_idxs = list(itertools.accumulate([0]+number_of_tokens_for_each_sample))
    for i in range(len(end_idxs)-1):
        token_strings = tokens[end_idxs[i]:end_idxs[i+1]]
        weights = non_zero_values[end_idxs[i]:end_idxs[i+1]]
        output.append(dict(zip(token_strings, weights)))
    return output
    

# load the model
model = AutoModelForMaskedLM.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")
tokenizer = AutoTokenizer.from_pretrained("opensearch-project/opensearch-neural-sparse-encoding-v1")

# set the special tokens and id_to_token transform for post-process
special_token_ids = [tokenizer.vocab[token] for token in tokenizer.special_tokens_map.values()]
get_sparse_vector.special_token_ids = special_token_ids
id_to_token = ["" for i in range(tokenizer.vocab_size)]
for token, _id in tokenizer.vocab.items():
    id_to_token[_id] = token
transform_sparse_vector_to_dict.id_to_token = id_to_token



query = "What's the weather in ny now?"
document = "Currently New York is rainy."

# encode the query & document
feature = tokenizer([query, document], padding=True, truncation=True, return_tensors='pt', return_token_type_ids=False)
output = model(**feature)[0]
sparse_vector = get_sparse_vector(feature, output)

# get similarity score
sim_score = torch.matmul(sparse_vector[0],sparse_vector[1])
print(sim_score)   # tensor(22.3299, grad_fn=<DotBackward0>)


query_token_weight, document_query_token_weight = transform_sparse_vector_to_dict(sparse_vector)
for token in sorted(query_token_weight, key=lambda x:query_token_weight[x], reverse=True):
    if token in document_query_token_weight:
        print("score in query: %.4f, score in document: %.4f, token: %s"%(query_token_weight[token],document_query_token_weight[token],token))
        

        
# result:
# score in query: 2.9262, score in document: 2.1335, token: ny
# score in query: 2.5206, score in document: 1.5277, token: weather
# score in query: 2.0373, score in document: 2.3489, token: york
# score in query: 1.5786, score in document: 0.8752, token: cool
# score in query: 1.4636, score in document: 1.5132, token: current
# score in query: 0.7761, score in document: 0.8860, token: season
# score in query: 0.7560, score in document: 0.6726, token: 2020
# score in query: 0.7222, score in document: 0.6292, token: summer
# score in query: 0.6888, score in document: 0.6419, token: nina
# score in query: 0.6451, score in document: 0.8200, token: storm
# score in query: 0.4698, score in document: 0.7635, token: brooklyn
# score in query: 0.4562, score in document: 0.1208, token: julian
# score in query: 0.3484, score in document: 0.3903, token: wow
# score in query: 0.3439, score in document: 0.4160, token: usa
# score in query: 0.2751, score in document: 0.8260, token: manhattan
# score in query: 0.2013, score in document: 0.7735, token: fog
# score in query: 0.1989, score in document: 0.2961, token: mood
# score in query: 0.1653, score in document: 0.3437, token: climate
# score in query: 0.1191, score in document: 0.1533, token: nature
# score in query: 0.0665, score in document: 0.0600, token: temperature
# score in query: 0.0552, score in document: 0.3396, token: windy

上述代码示例展示了神经稀疏搜索的一个例子。虽然原始查询和文档中没有重叠的标记，但该模型仍能实现良好的匹配。

✨ 主要特性

多数据集评估：在BEIR基准的一个子集上对模型的零样本性能进行了基准测试，包括TrecCovid、NFCorpus、NQ等多个数据集。
性能优势：总体而言，v2系列模型在搜索相关性、效率和推理速度方面优于v1系列，但具体优缺点可能因不同数据集而异。
稀疏向量编码：将查询和文档编码为30522维的稀疏向量，非零维度索引表示词汇表中对应的标记，权重表示标记的重要性。

📚 详细文档

选择模型

选择模型时应考虑搜索相关性、模型推理和检索效率（FLOPS）。以下是不同模型的性能对比：

模型	免推理检索	模型参数	平均NDCG@10	平均FLOPS
opensearch-neural-sparse-encoding-v1		1.33亿	0.524	11.4
opensearch-neural-sparse-encoding-v2-distill		6700万	0.528	8.3
opensearch-neural-sparse-encoding-doc-v1	✔️	1.33亿	0.490	2.3
opensearch-neural-sparse-encoding-doc-v2-distill	✔️	6700万	0.504	1.8
opensearch-neural-sparse-encoding-doc-v2-mini	✔️	2300万	0.497	1.7

详细搜索相关性

模型	平均值	Trec Covid	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche	DBPedia	SCIDOCS	FEVER	Climate FEVER	SciFact	Quora
opensearch-neural-sparse-encoding-v1	0.524	0.771	0.360	0.553	0.697	0.376	0.508	0.278	0.447	0.164	0.821	0.263	0.723	0.856
opensearch-neural-sparse-encoding-v2-distill	0.528	0.775	0.347	0.561	0.685	0.374	0.551	0.278	0.435	0.173	0.849	0.249	0.722	0.863
opensearch-neural-sparse-encoding-doc-v1	0.490	0.707	0.352	0.521	0.677	0.344	0.461	0.294	0.412	0.154	0.743	0.202	0.716	0.788
opensearch-neural-sparse-encoding-doc-v2-distill	0.504	0.690	0.343	0.528	0.675	0.357	0.496	0.287	0.418	0.166	0.818	0.224	0.715	0.841
opensearch-neural-sparse-encoding-doc-v2-mini	0.497	0.709	0.336	0.510	0.666	0.338	0.480	0.285	0.407	0.164	0.812	0.216	0.699	0.837

模型概述

论文：Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
微调示例：opensearch-sparse-model-tuning-sample

本模型是一个学习型稀疏检索模型，在MS MARCO数据集上进行训练。OpenSearch神经稀疏特征支持使用Lucene倒排索引进行学习型稀疏检索，链接：https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/ 。可以使用OpenSearch高级API进行索引和搜索。