🚀 Bhasha embed v0模型
Bhasha embed v0模型是一个嵌入模型,可对印地语(天城体文字)、英语和罗马化印地语的文本进行嵌入处理。目前有许多多语言嵌入模型在处理印地语和英语文本时表现良好,但缺乏以下能力:
- 支持罗马化印地语:这是首个支持罗马化印地语(音译印地语 / hin_Latn)的嵌入模型。
- 跨语言对齐:该模型输出与语言无关的嵌入,这使得它能够在包含印地语、英语和罗马化印地语文本的多语言候选池中进行查询。
✨ 主要特性
- 支持语言:印地语、英语、罗马化印地语。
- 基础模型:google/muril-base-cased
- 训练GPU:1xRTX4090
- 训练方法:从英语嵌入模型进行蒸馏,并在三元组数据上进行微调。
- 最大序列长度:512个标记
- 输出维度:768个标记
- 相似度函数:余弦相似度
模型来源
📚 详细文档
结果展示
英语 - 印地语跨语言对齐结果
适用于包含印地语和英语文本语料的任务。

罗马化印地语任务结果
适用于包含罗马化印地语文本的任务。

多语言语料检索任务结果
适用于包含印地语、英语和罗马化印地语文本语料的检索任务。

印地语任务结果
适用于包含印地语(天城体文字)文本的任务。

附加信息
示例输出
示例1

示例2

示例3

示例4

💻 使用示例
基础用法
以下是使用Sentence Transformers和🤗 Transformers对查询和段落进行编码并计算相似度分数的示例。
使用Sentence Transformers
首先安装Sentence Transformers库(pip install -U sentence-transformers
),然后运行以下代码:
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("AkshitaS/bhasha-embed-v0")
queries = [
"प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए",
"Pranav studied law and became a politician at the age of 30.",
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye"
]
documents = [
"प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए",
"Pranav studied law and became a politician at the age of 30.",
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye",
"प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था",
"Pranav was born in a family of politicians",
"Pranav ka janm rajneetigyon ke parivar mein hua tha"
]
query_embeddings = model.encode(queries, normalize_embeddings=True)
document_embeddings = model.encode(documents, normalize_embeddings=True)
similarity_matrix = (query_embeddings @ document_embeddings.T)
print(similarity_matrix.shape)
print(np.round(similarity_matrix, 2))
使用🤗 Transformers
import numpy as np
from torch import Tensor
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
model_id = "AkshitaS/bhasha-embed-v0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
queries = [
"प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए",
"Pranav studied law and became a politician at the age of 30.",
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye"
]
documents = [
"प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए",
"Pranav studied law and became a politician at the age of 30.",
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye",
"प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था",
"Pranav was born in a family of politicians",
"Pranav ka janm rajneetigyon ke parivar mein hua tha"
]
input_texts = queries + documents
batch_dict = tokenizer(input_texts, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
similarity_matrix = (embeddings[:len(queries)] @ embeddings[len(queries):].T).detach().numpy()
print(similarity_matrix.shape)
print(np.round(similarity_matrix, 2))
引用说明
如需引用此模型,请使用以下格式:
@misc{sukhlecha_2024_bhasha_embed_v0,
author = {Sukhlecha, Akshita},
title = {Bhasha-embed-v0},
howpublished = {Hugging Face},
month = {June},
year = {2024},
url = {https://huggingface.co/AkshitaS/bhasha-embed-v0}
}
📄 许可证
本模型采用Apache-2.0许可证。