bert-base-arabic-camelbert-msa-did-nadi开源模型 - 支持21种阿拉伯方言识别

首页

Bert Base Arabic Camelbert Msa Did Nadi

由 CAMeL-Lab 开发

基于CAMeLBERT现代标准阿拉伯语模型微调的方言识别模型，支持21种阿拉伯方言识别。

文本分类

Transformers

阿拉伯语开源协议:Apache-2.0 #阿拉伯语方言识别 #国家级别分类 #NADI数据集

下载量 41

发布时间 : 3/2/2022

模型简介

该模型是一个方言识别（DID）模型，通过微调CAMeLBERT现代标准阿拉伯语（MSA）模型构建而成，专门用于识别阿拉伯语的不同方言变体。

模型特点

多方言支持

能够识别21种不同的阿拉伯语方言变体。

基于CAMeLBERT微调

在强大的CAMeLBERT-MSA基础模型上进行微调，具有优秀的语言理解能力。

NADI数据集训练

使用NADI国家级别方言数据集进行训练，覆盖广泛的方言样本。

模型能力

阿拉伯语方言识别

文本分类

使用案例

语言研究

阿拉伯方言分析

识别文本中使用的阿拉伯语方言变体

可准确识别21种不同方言

社交媒体分析

用户地域分析

通过用户发帖识别其可能的地域来源

如识别埃及或沙特阿拉伯等地区的方言特征

🚀 CAMeLBERT-MSA DID NADI模型

CAMeLBERT-MSA DID NADI模型 是一个方言识别（DID）模型，通过微调 CAMeLBERT现代标准阿拉伯语（MSA）模型构建而成。该模型利用 NADI国家级别数据集进行微调，此数据集包含21个标签。微调过程和使用的超参数可在论文 "阿拉伯语预训练语言模型中变体、规模和任务类型的相互作用" 中找到，微调代码可在此处获取。

✨ 主要特性

基于微调的预训练模型，可用于阿拉伯语方言识别。
支持通过transformers管道使用，后续也将集成到 CAMeL Tools 中。

📦 安装指南

使用此模型需要 transformers>=3.5.0，若版本不满足，可手动下载模型。

💻 使用示例

基础用法

>>> from transformers import pipeline
>>> did = pipeline('text-classification', model='CAMeL-Lab/bert-base-arabic-camelbert-msa-did-nadi')
>>> sentences = ['عامل ايه ؟', 'شلونك ؟ شخبارك ؟']
>>> did(sentences)
[{'label': 'Egypt', 'score': 0.9242768287658691}, 
{'label': 'Saudi_Arabia', 'score': 0.3400847613811493}]

📚 详细文档

预期用途

可以将CAMeLBERT-MSA DID NADI模型作为transformers管道的一部分使用，该模型很快也将在 CAMeL Tools 中可用。

引用信息

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}