语言:
- 斯洛伐克语
推理: 不支持
标签:
- BERT
- HPLT
- 编码器
许可证: Apache-2.0
数据集:
- HPLT/hplt_monolingual_v1_2
HPLT 斯洛伐克语Bert模型
这是HPLT项目首次发布的编码器专用单语语言模型之一。它是一种所谓的掩码语言模型。具体而言,我们采用了名为LTG-BERT的经典BERT模型改进版。
针对HPLT 1.2数据发布中的每种主要语言(共75种模型),均训练了一个单语LTG-BERT模型。
所有HPLT编码器专用模型均采用相同超参数,大致遵循BERT-base配置:
- 隐藏层维度:768
- 注意力头数:12
- 网络层数:12
- 词表大小:32768
每个模型使用基于特定语言HPLT数据训练的分词器。训练语料规模、评估结果等详细信息请参阅我们的语言模型训练报告。
训练代码
75次训练的运行统计
使用示例
当前模型需要从modeling_ltgbert.py
加载自定义封装器,因此需设置trust_remote_code=True
。
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_sk")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_sk", trust_remote_code=True)
mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)
print(tokenizer.decode(output_text[0].tolist()))
目前已实现的类包括:AutoModel
、AutoModelMaskedLM
、AutoModelForSequenceClassification
、AutoModelForTokenClassification
、AutoModelForQuestionAnswering
和AutoModeltForMultipleChoice
。
中间检查点
我们在独立分支中为每个模型发布了10个中间检查点(每3125训练步间隔)。命名格式为stepXXX
,例如step18750
。
可通过transformers
的revision
参数加载特定版本:
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_sk", revision="step21875", trust_remote_code=True)
查看所有可用版本:
from huggingface_hub import list_repo_refs
out = list_repo_refs("HPLT/hplt_bert_base_sk")
print([b.name for b in out.branches])
引用我们
@inproceedings{samuel-etal-2023-trained,
title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
author = "Samuel, David and
Kutuzov, Andrey and
{\O}vrelid, Lilja and
Velldal, Erik",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-eacl.146",
doi = "10.18653/v1/2023.findings-eacl.146",
pages = "1954--1974"
})
@inproceedings{de-gibert-etal-2024-new-massive,
title = "A New Massive Multilingual Dataset for High-Performance Language Technologies",
author = {de Gibert, Ona and
Nail, Graeme and
Arefyev, Nikolay and
Ba{\~n}{\'o}n, Marta and
van der Linde, Jelmer and
Ji, Shaoxiong and
Zaragoza-Bernabeu, Jaume and
Aulamo, Mikko and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Kutuzov, Andrey and
Pyysalo, Sampo and
Oepen, Stephan and
Tiedemann, J{\"o}rg},
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.100",
pages = "1116--1128",
abstract = "We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of {\mbox{$\approx$}} 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.",
}