language:
- pt
license: other
tags:
- bert
- pytorch
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- fill-mask
- NSP
- Next Sentence Prediction
datasets:
- brwac
library_name: transformers
pipeline_tag: fill-mask
BERTugues 基础版(又名"BERTugues-base-portuguese-cased")
简介
BERTugues模型严格遵循BERT原始论文的预训练流程,通过100万步训练,使用超过20GB文本数据,完成了掩码语言建模(MLM)和下一句预测(NSP)两大核心任务。训练细节详见已发表论文。与Bertimbau类似,本模型基于BrWAC语料库和葡萄牙语维基百科训练分词器,并在训练流程中实现了三大改进:
- 剔除葡萄牙语罕见字符:Bertimbau的29794个token中有7000多个包含东亚文字等特殊字符(如"##漫"、"##켝"、"##前"),而BERTugues在训练分词器前已过滤这些字符;
- 😀 添加高频表情符号:鉴于维基百科文本中表情符号稀少,我们参考相关研究证实表情符号对多类任务的重要性,特别将其加入分词器;
- BrWAC语料质量过滤:采用Google Gopher模型论文提出的启发式方法,剔除低质量文本。
分词器优化
通过替换葡语低频token,我们显著降低了文本被切分为多个token的比例。在assin2数据集(Bertimbau硕士论文所用测试集)上的实验显示,每文本平均切分次数从3.8降至3.0,而多语言BERT模型该数值为7.4。

性能表现
我们在葡萄牙语版IMDB电影评论数据集上测试文本分类任务,使用BERTugues生成的句子表征配合随机森林分类器。同时参照JurisBERT论文的评估方案,在其提供的法律文本比对代码中加入BERTugues进行对比,判断两文本是否属于同一法律主题。
模型 |
IMDB(F1) |
STJ(F1) |
PJERJ(F1) |
TJMS(F1) |
平均F1 |
多语言BERT |
72.0% |
30.4% |
63.8% |
65.0% |
57.8% |
Bertimbau-Base |
82.2% |
35.6% |
63.9% |
71.2% |
63.2% |
Bertimbau-Large |
85.3% |
43.0% |
63.8% |
74.0% |
66.5% |
BERTugues-Base |
84.0% |
45.2% |
67.5% |
70.0% |
66.7% |
BERTugues在3/4任务上超越Bertimbau-Base,在2/4任务上优于参数量大三倍的Bertimbau-Large模型。
使用示例
更多用例详见GitHub仓库,以下是两个典型示例:
掩码语言建模:
from transformers import BertTokenizer, BertForMaskedLM, pipeline
model = BertForMaskedLM.from_pretrained("ricardoz/BERTugues-base-portuguese-cased")
tokenizer = BertTokenizer.from_pretrained("ricardoz/BERTugues-base-portuguese-cased", do_lower_case=False)
pipe = pipeline('fill-mask', model=model, tokenizer=tokenizer, top_k = 3)
pipe('[CLS] Eduardo abriu os [MASK], mas não quis se levantar. Ficou deitado e viu que horas eram.')
生成句子嵌入:
from transformers import BertTokenizer, BertModel, pipeline
import torch
model = BertModel.from_pretrained("ricardoz/BERTugues-base-portuguese-cased")
tokenizer = BertTokenizer.from_pretrained("ricardoz/BERTugues-base-portuguese-cased", do_lower_case=False)
input_ids = tokenizer.encode('[CLS] Eduardo abriu os olhos, mas não quis se levantar. Ficou deitado e viu que horas eram.', return_tensors='pt')
with torch.no_grad():
last_hidden_state = model(input_ids).last_hidden_state[:, 0]
引用规范
使用本模型时请引用以下文献:
@article{Zago2024bertugues,
title = {BERTugues: A Novel BERT Transformer Model Pre-trained for Brazilian Portuguese},
volume = {45},
url = {https://ojs.uel.br/revistas/uel/index.php/semexatas/article/view/50630},
DOI = {10.5433/1679-0375.2024.v45.50630},
journal = {Semina: Ciências Exatas e Tecnológicas},
author = {Mazza Zago, Ricardo and Agnoletti dos Santos Pedotti, Luciane},
year = {2024},
month = {Dec.},
pages = {e50630}
}
更多信息
访问GitHub项目主页获取完整资料!