语言:
标签:
robertuito-base-uncased
RoBERTuito
面向西班牙语社交媒体文本的预训练语言模型
论文
GitHub代码库

RoBERTuito 是一个针对西班牙语用户生成内容的预训练语言模型,遵循RoBERTa的训练标准,基于5亿条推文训练而成。RoBERTuito 提供三种变体:区分大小写、不区分大小写和不区分大小写+去除重音。
我们在涉及西班牙语用户生成文本的任务基准测试中评估了RoBERTuito。其表现优于其他西班牙语预训练语言模型如BETO、BERTin和RoBERTa-BNE。选定的四项评估任务包括:仇恨言论检测(使用SemEval 2019 Task 5的HatEval数据集)、情感与情绪分析(使用TASS 2020数据集)以及反讽检测(使用IrosVa 2019数据集)。
模型 |
仇恨言论检测 |
情感分析 |
情绪分析 |
反讽检测 |
综合得分 |
robertuito-uncased |
0.801 ± 0.010 |
0.707 ± 0.004 |
0.551 ± 0.011 |
0.736 ± 0.008 |
0.6987 |
robertuito-deacc |
0.798 ± 0.008 |
0.702 ± 0.004 |
0.543 ± 0.015 |
0.740 ± 0.006 |
0.6958 |
robertuito-cased |
0.790 ± 0.012 |
0.701 ± 0.012 |
0.519 ± 0.032 |
0.719 ± 0.023 |
0.6822 |
roberta-bne |
0.766 ± 0.015 |
0.669 ± 0.006 |
0.533 ± 0.011 |
0.723 ± 0.017 |
0.6726 |
bertin |
0.767 ± 0.005 |
0.665 ± 0.003 |
0.518 ± 0.012 |
0.716 ± 0.008 |
0.6666 |
beto-cased |
0.768 ± 0.012 |
0.665 ± 0.004 |
0.521 ± 0.012 |
0.706 ± 0.007 |
0.6651 |
beto-uncased |
0.757 ± 0.012 |
0.649 ± 0.005 |
0.521 ± 0.006 |
0.702 ± 0.008 |
0.6571 |
我们已在HuggingFace模型中心发布预训练模型:
掩码语言模型
测试掩码语言模型时需注意:空格已编码在SentencePiece的分词中。因此若要测试
Este es un día<mask>
请勿在día
和<mask>
之间添加空格
使用说明
重要提示——请首先阅读
RoBERTuito尚未完全集成至huggingface/transformers
。使用时需先安装pysentimiento
:
pip install pysentimiento
并在输入文本到分词器前,使用pysentimiento.preprocessing.preprocess_tweet
进行预处理:
from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet
tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')
text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)
tokenizer.tokenize(preprocessed_text)
我们正在将该预处理步骤整合至transformers
库内的Tokenizer中。
文本分类示例可在此笔记本查看:
引用
若使用RoBERTuito,请引用我们的论文:
@inproceedings{perez-etal-2022-robertuito,
title = "{R}o{BERT}uito: a pre-trained language model for social media text in {S}panish",
author = "P{\'e}rez, Juan Manuel and
Furman, Dami{\'a}n Ariel and
Alonso Alemany, Laura and
Luque, Franco M.",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.785",
pages = "7235--7243",
abstract = "Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for natural language processing tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks; however, for languages other than English, such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model has some cross-lingual abilities, achieving top results for English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and also competitive performance against monolingual models in English Twitter tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it.",
}