IndicBERTv2-MLM-only开源多语言模型 - 支持23种印度语及英语文本处理

首页

Indicbertv2 MLM Only

由 ai4bharat 开发

IndicBERT是一个支持23种印度语言及英语的多语言语言模型，拥有2.78亿参数，在IndicCorp v2上训练并在IndicXTREME基准测试中评估。

大型语言模型

Transformers

支持多种语言开源协议:MIT #多语言印度语支持 #填充掩码任务 #大规模语料训练

下载量 87.60k

发布时间 : 11/13/2022

模型简介

IndicBERT是一个多语言BERT风格模型，专注于印度语言处理，通过多种训练目标和数据集优化，支持填充掩码任务。

模型特点

多语言支持

支持23种印度语言及英语，覆盖多种语言家族。

多种训练目标

通过MLM、TLM及反向翻译等多种目标训练，提升模型性能。

词汇共享优化

IndicBERT-SS版本通过文字转换促进语言间更好的词汇共享。

模型能力

多语言文本理解

填充掩码任务处理

跨语言迁移学习

使用案例

自然语言理解

命名实体识别

在多种印度语言中识别命名实体。

情感分析

分析印度语言文本的情感倾向。

机器翻译辅助

平行语料库增强

通过TLM训练提升机器翻译模型的性能。

🚀 IndicBERT

IndicBERT是一个多语言语言模型，在IndicCorp v2上进行训练，并在IndicXTREME基准测试中进行评估。该模型拥有2.78亿个参数，支持23种印度语言以及英语。模型通过多种目标和数据集进行训练。

支持语言

属性	详情
支持语言列表	as、bn、brx、doi、en、gom、gu、hi、kn、ks、kas、mai、ml、mr、mni、mnb、ne、or、pa、sa、sat、sd、snd、ta、te、ur
语言详情	asm_Beng、ben_Beng、brx_Deva、doi_Deva、eng_Latn、gom_Deva、guj_Gujr、hin_Deva、kan_Knda、kas_Arab、kas_Deva、mai_Deva、mal_Mlym、mar_Deva、mni_Beng、mni_Mtei、npi_Deva、ory_Orya、pan_Guru、san_Deva、sat_Olck、snd_Arab、snd_Deva、tam_Taml、tel_Telu、urd_Arab

模型标签

indicbert2
ai4bharat
multilingual

许可证

本项目采用MIT许可证。

评估指标

准确率

任务类型

填充掩码

🚀 快速开始

模型列表

IndicBERT-MLM [模型] - 一个基于IndicCorp v2，使用MLM目标训练的经典BERT风格模型
- +Samanantar [模型] - 以Samanantar平行语料库为额外目标的TLM训练模型 [论文] | [数据集]
- +反向翻译 [模型] - 通过IndicTrans模型将IndicCorp v2数据集中的印度部分翻译成英语，作为额外目标的TLM训练模型 [模型]
IndicBERT-SS [模型] - 为了促进语言间更好的词汇共享，将印度语言的文字转换为天城文，并使用MLM目标训练的BERT风格模型

📦 安装指南

微调脚本基于transformers库。创建一个新的conda环境并按如下方式进行设置：

conda create -n finetuning python=3.9
pip install -r requirements.txt

💻 使用示例

基础用法

所有任务遵循相同的结构，请查看各个文件以获取详细的超参数选择。以下命令用于运行某个任务的微调：

python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
    --model_name_or_path=$MODEL_NAME \
    --do_train

参数说明

MODEL_NAME: 要微调的模型名称，可以是本地路径或来自HuggingFace模型中心的模型
TASK_NAME: 以下任务之一 [ner, paraphrase, qa, sentiment, xcopa, xnli, flores]

⚠️ 重要提示

对于MASSIVE任务，请使用官方仓库中提供的说明。

📚 引用说明

@inproceedings{doddapaneni-etal-2023-towards,
    title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages",
    author = "Doddapaneni, Sumanth  and
      Aralikatte, Rahul  and
      Ramesh, Gowtham  and
      Goyal, Shreya  and
      Khapra, Mitesh M.  and
      Kunchukuttan, Anoop  and
      Kumar, Pratyush",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.693",
    doi = "10.18653/v1/2023.acl-long.693",
    pages = "12402--12426",
    abstract = "Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.",
}