bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0开源模型

首页

Bert Large Portuguese Cased Legal Tsdae Gpl Nli Sts MetaKD V0

由 stjiris 开发

这是一个基于BERTimbau的大型葡萄牙语法律领域句子转换器模型，专门用于处理法律文本的语义相似度任务。

文本嵌入

Transformers

其他开源协议:MIT #葡萄牙语法律文本 #语义相似度计算 #元数据知识蒸馏

下载量 63

发布时间 : 3/3/2023

模型简介

该模型将句子和段落映射到1024维的密集向量空间，可用于聚类或语义搜索等任务。它是BERTimbau大型模型的法律领域变体，经过TSDAE技术训练，并在NLI和STS任务上进行了微调。

模型特点

法律领域优化

专门针对葡萄牙语法律文本进行优化，在最高法院语义搜索系统中表现优异

元数据知识蒸馏

采用元数据知识蒸馏技术，通过密集向量改进信息检索效果

多阶段训练

先通过TSDAE无监督训练，再进行NLI和STS任务的微调

模型能力

句子嵌入生成

语义相似度计算

法律文本分析

信息检索

使用案例

法律信息检索

最高法院案例搜索

用于构建葡萄牙最高法院的语义搜索系统

相比BM25方法，首条查询结果的发现指标提高了335%

法律文本分析

法律文件相似度分析

计算不同法律文件或判决书之间的语义相似度

🚀 stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0 (Legal BERTimbau)

这是一个用于法律领域的葡萄牙语BERT模型，它可以将句子和段落映射到1024维的密集向量空间，适用于聚类或语义搜索等任务。

本项目是IRIS项目的一部分。

论文：A Semantic Search System for Supremo Tribunal de Justiça

✨ 主要特性

基于sentence-transformers，可将句子和段落映射到1024维的密集向量空间。
源自stjiris/bert-large-portuguese-cased-legal-tsdae（BERTimbau large的法律变体）。
使用TSDAE技术进行训练，学习率为1e-5，使用约30000份文档中的法律句子，进行了212k个训练步骤。
经过生成式伪标签（Generative Pseudo Labeling）训练。
引入了NLI数据进行训练，批次大小为16，学习率为2e-5。
针对语义文本相似度进行训练，并使用assin、assin2、stsb_multi_mt pt数据集进行微调。
经过元数据知识蒸馏（Metadata Knowledge Distillation）处理。

模型指标

任务	数据集	指标类型	指标值
STS	assin	Pearson Correlation	0.80743090316288
STS	assin2	Pearson Correlation	0.8404118493167052
STS	stsb_multi_mt pt	Pearson Correlation	0.7829399973091388

📦 安装指南

使用此模型前，需要安装sentence-transformers：

pip install -U sentence-transformers

💻 使用示例

基础用法（Sentence-Transformers）

from sentence_transformers import SentenceTransformer
sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]

model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0')
embeddings = model.encode(sentences)
print(embeddings)

高级用法（HuggingFace Transformers）

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0')
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

📚 详细文档

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1028, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)

📄 许可证

本模型使用MIT许可证。

📖 引用与作者

贡献者

@rufimelo99

如果使用本项目，请引用以下文献：

@InProceedings{MeloSemantic,
  author="Melo, Rui
  and Santos, Pedro A.
  and Dias, Jo{\~a}o",
  editor="Moniz, Nuno
  and Vale, Zita
  and Cascalho, Jos{\'e}
  and Silva, Catarina
  and Sebasti{\~a}o, Raquel",
  title="A Semantic Search System for the Supremo Tribunal de Justi{\c{c}}a",
  booktitle="Progress in Artificial Intelligence",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="142--154",
  abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justi{\c{c}}a (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {\$}{\$}335{\backslash}{\%}{\$}{\$}335{\%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.",
  isbn="978-3-031-49011-8"
}

@inproceedings{souza2020bertimbau,
  author    = {F{\'a}bio Souza and
               Rodrigo Nogueira and
               Roberto Lotufo},
  title     = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
  booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
  year      = {2020}
}

@inproceedings{fonseca2016assin,
  title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
  author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
  booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
  pages={13--15},
  year={2016}
}

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}