RoBERTuito-base-uncased开源语言模型 - 基于5亿推文训练，处理西语社交文本超棒

首页

Robertuito Base Uncased

由 pysentimiento 开发

RoBERTuito是一个针对西班牙语社交媒体文本预训练的语言模型，基于5亿条推文训练，在多项西班牙语社交媒体文本任务中表现优异。

大型语言模型

Transformers

西班牙语#西班牙语社交媒体 #仇恨言论检测 #情感情绪分析

下载量 1,451

发布时间 : 3/2/2022

模型简介

RoBERTuito是专门为西班牙语社交媒体文本设计的预训练语言模型，遵循RoBERTa的训练标准，适用于仇恨言论检测、情感分析、情绪分析和反讽检测等任务。

模型特点

社交媒体文本优化

专门针对西班牙语社交媒体文本（如推文）进行预训练，能更好地处理用户生成内容。

多种变体选择

提供区分大小写、不区分大小写和不区分大小写+去除重音三种变体，适应不同需求。

性能优越

在西班牙语社交媒体文本任务上表现优于其他西班牙语预训练模型如BETO、BERTin和RoBERTa-BNE。

模型能力

仇恨言论检测

情感分析

情绪分析

反讽检测

西班牙语文本理解

使用案例

社交媒体分析

仇恨言论检测

识别社交媒体上的仇恨言论内容

在HatEval数据集上F1得分0.801

情感分析

分析社交媒体文本的情感倾向

在TASS 2020数据集上F1得分0.707

反讽检测

识别社交媒体文本中的反讽表达

在IrosVa 2019数据集上F1得分0.736

🚀 RoBERTuito - 无大小写区分基础版本

RoBERTuito 是一个针对西班牙语社交媒体文本的预训练语言模型，在 5 亿条推文上按照 RoBERTa 准则进行训练。它能有效处理西班牙语的用户生成内容，在多项相关任务中表现出色。

🚀 快速开始

RoBERTuito 尚未完全集成到 huggingface/transformers 中。若要使用，需先安装 pysentimiento：

pip install pysentimiento

在将文本输入分词器之前，使用 pysentimiento.preprocessing.preprocess_tweet 对文本进行预处理：

from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet

tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')

text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)

tokenizer.tokenize(preprocessed_text)
# ['<s>','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','</s>']

我们正在努力将此预处理步骤集成到 transformers 库的分词器中。你可以在这个笔记本中查看文本分类示例：

✨ 主要特性

多版本支持：RoBERTuito 有 3 种版本：区分大小写（cased）、不区分大小写（uncased）和不区分大小写且去除重音（uncased+deaccented）。
性能卓越：在涉及西班牙语用户生成文本的任务基准测试中，它的表现优于其他西班牙语预训练语言模型，如 BETO、BERTin 和 RoBERTa-BNE。

📦 安装指南

要使用 RoBERTuito，需先安装 pysentimiento：

pip install pysentimiento

💻 使用示例

基础用法

from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet

tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')

text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)

tokenizer.tokenize(preprocessed_text)
# ['<s>','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','</s>']

📚 详细文档

模型评估

我们在涉及西班牙语用户生成文本的任务基准测试中对 RoBERTuito 进行了测试。所选的 4 个评估任务包括：仇恨言论检测（使用 SemEval 2019 Task 5、HatEval 数据集）、情感和情绪分析（使用 TASS 2020 数据集）以及反讽检测（使用 IrosVa 2019 数据集）。

模型	仇恨言论检测	情感分析	情绪分析	反讽检测	综合得分
robertuito-uncased	0.801 ± 0.010	0.707 ± 0.004	0.551 ± 0.011	0.736 ± 0.008	0.6987
robertuito-deacc	0.798 ± 0.008	0.702 ± 0.004	0.543 ± 0.015	0.740 ± 0.006	0.6958
robertuito-cased	0.790 ± 0.012	0.701 ± 0.012	0.519 ± 0.032	0.719 ± 0.023	0.6822
roberta-bne	0.766 ± 0.015	0.669 ± 0.006	0.533 ± 0.011	0.723 ± 0.017	0.6726
bertin	0.767 ± 0.005	0.665 ± 0.003	0.518 ± 0.012	0.716 ± 0.008	0.6666
beto-cased	0.768 ± 0.012	0.665 ± 0.004	0.521 ± 0.012	0.706 ± 0.007	0.6651
beto-uncased	0.757 ± 0.012	0.649 ± 0.005	0.521 ± 0.006	0.702 ± 0.008	0.6571

掩码语言模型（Masked LM）测试

测试掩码语言模型时，需注意空格已编码在 SentencePiece 的标记中。因此，若要测试：

Este es un día<mask>

不要在 día 和 <mask> 之间留空格。

模型发布

我们在 Hugging Face 模型中心发布了预训练模型：

📄 许可证

如果你使用 RoBERTuito，请引用我们的论文：

@inproceedings{perez-etal-2022-robertuito,
    title = "{R}o{BERT}uito: a pre-trained language model for social media text in {S}panish",
    author = "P{\'e}rez, Juan Manuel  and
      Furman, Dami{\'a}n Ariel  and
      Alonso Alemany, Laura  and
      Luque, Franco M.",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.785",
    pages = "7235--7243",
    abstract = "Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for natural language processing tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks; however, for languages other than English, such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model has some cross-lingual abilities, achieving top results for English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and also competitive performance against monolingual models in English Twitter tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it.",
}