语言:
- 葡萄牙语
缩略图: 葡萄牙语法律领域BERT
标签:
- 句子转换器
- 转换器
- BERT
- PyTorch
- 句子相似度
许可证: MIT
流水线标签: 句子相似度
数据集:
- stjiris/portuguese-legal-sentences-v0
- assin
- assin2
- stsb_multi_mt
小部件:
- 源句子: "律师向法官提交了证据。"
句子:
- "法官阅读了证据。"
- "法官阅读了上诉书。"
- "法官扔了一块石头。"
模型索引:
- 名称: BERTimbau
结果:
- 任务:
名称: STS
类型: STS
指标:
- 名称: 皮尔逊相关性 - assin数据集
类型: 皮尔逊相关性
值: 0.80743090316288
- 名称: 皮尔逊相关性 - assin2数据集
类型: 皮尔逊相关性
值: 0.8404118493167052
- 名称: 皮尔逊相关性 - stsb_multi_mt pt数据集
类型: 皮尔逊相关性
值: 0.7829399973091388


作为IRIS项目的一部分开发的工作。
论文: 最高法院语义搜索系统
stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0 (法律BERTimbau)
这是一个sentence-transformers模型: 它将句子和段落映射到1024维的密集向量空间,可用于聚类或语义搜索等任务。
stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-v0源自stjiris/bert-large-portuguese-cased-legal-tsdae(BERTimbau大型的法律变体)。
它使用TSDAE技术进行训练,学习率为1e-5,来自约30000份法律句子的212k训练步骤(在我们的语义搜索系统实现中表现最佳)。
它被用于生成伪标签训练。
该模型被用于NLI数据训练。批量大小为16,学习率为2e-5。
它被训练用于语义文本相似性,通过assin、assin2、stsb_multi_mt pt数据集进行微调。学习率为1e-5。
该模型接受了元数据知识蒸馏。仓库
尝试通过密集向量改进信息检索的技术:元数据知识蒸馏
使用(Sentence-Transformers)
安装sentence-transformers后,使用此模型变得简单:
pip install -U sentence-transformers
然后可以像这样使用模型:
from sentence_transformers import SentenceTransformer
sentences = ["这是一个例子", "这是另一个例子"]
model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0')
embeddings = model.encode(sentences)
print(embeddings)
使用(HuggingFace Transformers)
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['这是一个例句', '每个句子都被转换']
tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0')
model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("句子嵌入:")
print(sentence_embeddings)
完整模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1028, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
)
引用与作者
贡献者
@rufimelo99
如果您使用此工作,请引用:
@InProceedings{MeloSemantic,
author="Melo, Rui
and Santos, Pedro A.
and Dias, João",
editor="Moniz, Nuno
and Vale, Zita
and Cascalho, José
and Silva, Catarina
and Sebastião, Raquel",
title="A Semantic Search System for the Supremo Tribunal de Justiça",
booktitle="Progress in Artificial Intelligence",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="142--154",
abstract="Many information retrieval systems use lexical approaches to retrieve information. Such approaches have multiple limitations, and these constraints are exacerbated when tied to specific domains, such as the legal one. Large language models, such as BERT, deeply understand a language and may overcome the limitations of older methodologies, such as BM25. This work investigated and developed a prototype of a Semantic Search System to assist the Supremo Tribunal de Justiça (Portuguese Supreme Court of Justice) in its decision-making process. We built a Semantic Search System that uses specially trained BERT models (Legal-BERTimbau variants) and a Hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau. In this context, we obtained a {\$}{\$}335{\backslash}{\%}{\$}{\$}335{\%}increase on the discovery metric when compared to BM25 for the first query result. This work also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence and introduces a new technique of Metadata Knowledge Distillation.",
isbn="978-3-031-49011-8"
}
@inproceedings{souza2020bertimbau,
author = {Fábio Souza and
Rodrigo Nogueira and
Roberto Lotufo},
title = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
year = {2020}
}
@inproceedings{fonseca2016assin,
title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
pages={13--15},
year={2016}
}
@inproceedings{real2020assin,
title={The assin 2 shared task: a quick overview},
author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
booktitle={International Conference on Computational Processing of the Portuguese Language},
pages={406--412},
year={2020},
organization={Springer}
}
@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}