it5-base开源意大利语模型 - 基于T5架构开启意语处理新应用

首页

It5 Base

由 gsarti 开发

IT5是首个针对意大利语进行大规模序列到序列Transformer模型预训练的尝试，基于T5模型架构。

大型语言模型其他开源协议:Apache-2.0 #意大利语生成 #序列到序列 #大规模预训练

下载量 389

发布时间 : 3/2/2022

模型简介

该模型是意大利语文本到文本转换模型的基础版本，主要用于意大利语的理解和生成任务，需要在下游任务上进行微调才能使用。

模型特点

意大利语专用预训练

首个专门针对意大利语进行大规模预训练的序列到序列Transformer模型

基于改进版T5架构

采用google/t5-v1_1-base改进配置，使用门控GELU激活函数

大规模训练数据

在清理过的意大利语mC4语料库（约410亿词）上训练

多框架支持

提供PyTorch、Flax和TensorFlow三种实现版本

模型能力

意大利语文本理解

意大利语文本生成

序列到序列转换

使用案例

文本生成

新闻摘要

对意大利语新闻文章进行自动摘要

需要微调后使用

文本转换

语言改写

意大利语文本的改写和简化

需要微调后使用

🚀 意大利语T5基础模型🇮🇹

意大利语T5（IT5）模型家族是首次针对意大利语进行大规模序列到序列Transformer模型预训练的尝试，其采用了原始 T5模型的方法。该模型能够助力意大利语相关的自然语言处理任务，如文本生成、理解等，为意大利语的处理提供了强大的工具。

🚀 快速开始

模型变体

本仓库包含了模型 base 版本的检查点。该模型在深度清理的意大利语mC4语料库（约410亿个单词，约275GB）上使用 🤗 Datasets 和 google/t5-v1_1-base 改进配置进行了一个轮次（105万步）的训练。另一个在 OSCAR语料库上训练的版本也可通过名称 gsarti/it5-base-oscar 获取。训练过程可在 Github 上查看。

以下表格总结了所有可用模型的参数：

属性	`it5-small`	`it5-base`（本模型）	`it5-large`	`it5-base-oscar`
数据集	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`gsarti/clean_mc4_it`	`oscar/unshuffled_deduplicated_it`
架构	`google/t5-v1_1-small`	`google/t5-v1_1-base`	`google/t5-v1_1-large`	`t5-base`
学习率	5e - 3	5e - 3	5e - 3	1e - 2
步数	1050000	1050000	2100000	258000
训练时间	36小时	101小时	370小时	98小时
前馈投影	`gated - gelu`	`gated - gelu`	`gated - gelu`	`relu`
嵌入绑定	`false`	`false`	`false`	`true`
优化器	adafactor	adafactor	adafactor	adafactor
最大序列长度	512	512	512	512
每设备批量大小	16	16	8	16
总批量大小	128	128	64	128
权重衰减	1e - 3	1e - 3	1e - 2	1e - 3
验证集分割大小	15000个示例	15000个示例	15000个示例	15000个示例

it5-base-oscar 训练时间较长是由于训练脚本中的一个bug 导致的。若需查看单个模型的参数列表，请参考各自仓库中的 config.json 文件。

使用模型

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("gsarti/it5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("gsarti/it5-base")

⚠️ 重要提示

你需要在下游的序列到序列任务上对模型进行微调才能使用它。可参考此处的示例。

模型的Flax和Tensorflow版本同样可用：

from transformers import FlaxT5ForConditionalGeneration, TFT5ForConditionalGeneration

model_flax = FlaxT5ForConditionalGeneration.from_pretrained("gsarti/it5-base")
model_tf = TFT5ForConditionalGeneration.from_pretrained("gsarti/it5-base")

🔧 局限性

由于IT5模型是在网络抓取的语料库上进行训练的，使用这些模型可能会重现并放大数据中已有的偏差，从而产生潜在的有害内容，如种族或性别刻板印象以及阴谋论观点。因此，我们鼓励对这些偏差进行研究，并且理想情况下，模型的使用应仅限于面向研究且不直接面向用户的项目。

📄 许可证

本模型采用Apache 2.0许可证。

🛠️ 模型维护者

若你在使用此模型过程中遇到问题或需要更新，请联系 gabriele.sarti996@gmail.com。

📚 引用信息

@inproceedings{sarti-nissim-2024-it5-text,
    title = "{IT}5: Text-to-text Pretraining for {I}talian Language Understanding and Generation",
    author = "Sarti, Gabriele  and
      Nissim, Malvina",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.823",
    pages = "9422--9433",
    abstract = "We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.",
}