🚀 巽他语RoBERTa基础模型
巽他语RoBERTa基础模型是一个基于RoBERTa架构的掩码语言模型。它在四个数据集上进行训练:OSCAR的unshuffled_deduplicated_su
子集、巽他语mc4子集、巽他语CC100子集以及巽他语维基百科。该模型为自然语言处理领域提供了专门针对巽他语的预训练能力,有助于提升相关任务的处理效果。
✨ 主要特性
- 基于RoBERTa架构,具有强大的语言理解能力。
- 在多个大规模巽他语数据集上进行训练,数据覆盖广泛。
- 从头开始训练,在评估中取得了1.952的损失和63.98%的准确率。
📦 安装指南
文档未提及具体安装步骤,可参考Hugging Face的通用安装方法来安装相关依赖库。
💻 使用示例
基础用法
from transformers import pipeline
pretrained_name = "w11wo/sundanese-roberta-base"
fill_mask = pipeline(
"fill-mask",
model=pretrained_name,
tokenizer=pretrained_name
)
fill_mask("Budi nuju <mask> di sakola.")
高级用法
from transformers import RobertaModel, RobertaTokenizerFast
pretrained_name = "w11wo/sundanese-roberta-base"
model = RobertaModel.from_pretrained(pretrained_name)
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)
prompt = "Budi nuju diajar di sakola."
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)
📚 详细文档
模型信息
属性 |
详情 |
模型类型 |
sundanese-roberta-base |
训练参数数量 |
1.24亿 |
架构 |
RoBERTa |
训练/验证数据(文本) |
OSCAR、mC4、CC100、维基百科(758MB) |
评估结果
该模型训练了50个轮次,训练结束后的最终结果如下:
训练损失 |
验证损失 |
验证准确率 |
总耗时 |
1.965 |
1.952 |
0.6398 |
6:24:51 |
🔧 技术细节
- 该模型使用Hugging Face的Flax框架进行训练。
- 所有训练所需的脚本可以在文件和版本标签中找到。
- 通过TensorBoard记录的训练指标也可查看。
📄 许可证
本项目采用MIT许可证。
⚠️ 重要提示
请考虑来自四个数据集的偏差可能会延续到该模型的结果中。
📖 引用信息
@article{rs-907893,
author = {Wongso, Wilson
and Lucky, Henry
and Suhartono, Derwin},
journal = {Journal of Big Data},
year = {2022},
month = {Feb},
day = {26},
abstract = {The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.},
issn = {2693-5015},
doi = {10.21203/rs.3.rs-907893/v1},
url = {https://doi.org/10.21203/rs.3.rs-907893/v1}
}
👨💻 作者
巽他语RoBERTa基础模型由Wilson Wongso训练和评估。