开源CAMeLBERT CA版模型 - 专用于古典阿拉伯语文本处理的免费工具

首页

Bert Base Arabic Camelbert Ca

由 CAMeL-Lab 开发

CAMeLBERT是针对阿拉伯语变体优化的BERT模型集合，CA版本专门针对古典阿拉伯语文本预训练

大型语言模型阿拉伯语开源协议:Apache-2.0 #古典阿拉伯语处理 #多任务微调 #阿拉伯语NLP

下载量 1,128

发布时间 : 3/2/2022

模型简介

基于古典阿拉伯语(CA)数据集预训练的BERT模型，适用于阿拉伯语NLP任务微调

模型特点

古典阿拉伯语优化

专门针对6GB古典阿拉伯语文本预训练，在诗歌分类等CA任务上表现优异(F1 80.9%)

多任务适配

支持NER、词性标注、情感分析、方言识别和诗歌分类等12个阿拉伯语NLP任务

变体敏感处理

保留字母大小写及重音符号，采用全词掩码策略增强语言特征学习

模型能力

掩码语言建模

下一句预测

命名实体识别

词性标注

情感分析

方言识别

诗歌分类

使用案例

古典文学分析

阿拉伯诗歌分类

对古典阿拉伯诗歌进行自动分类

在APCD数据集上达到80.9% F1分数

语言学研究

古典文本分析

分析古典阿拉伯语文本的语言特征

教育技术

阿拉伯语学习辅助

帮助学习者理解古典阿拉伯语语法和词汇

🚀 CAMeLBERT：用于阿拉伯语自然语言处理任务的预训练模型集合

CAMeLBERT是一系列针对阿拉伯语自然语言处理任务的预训练模型。这些模型基于不同规模和变体的阿拉伯语文本进行预训练，可用于多种自然语言处理任务，如命名实体识别、词性标注、情感分析等。

🚀 快速开始

你可以直接使用该模型进行掩码语言建模任务：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-arabic-camelbert-ca')
>>> unmasker("الهدف من الحياة هو [MASK] .")
[{'sequence': '[CLS] الهدف من الحياة هو الحياة. [SEP]',
  'score': 0.11048116534948349,
  'token': 3696,
  'token_str': 'الحياة'},
 {'sequence': '[CLS] الهدف من الحياة هو الإسلام. [SEP]',
  'score': 0.03481195122003555,
  'token': 4677,
  'token_str': 'الإسلام'},
 {'sequence': '[CLS] الهدف من الحياة هو الموت. [SEP]',
  'score': 0.03402028977870941,
  'token': 4295,
  'token_str': 'الموت'},
 {'sequence': '[CLS] الهدف من الحياة هو العلم. [SEP]',
  'score': 0.027655426412820816,
  'token': 2789,
  'token_str': 'العلم'},
 {'sequence': '[CLS] الهدف من الحياة هو هذا. [SEP]',
  'score': 0.023059621453285217,
  'token': 2085,
  'token_str': 'هذا'}]

注意：要下载我们的模型，你需要transformers>=3.5.0。否则，你可以手动下载模型。

以下是在PyTorch中使用该模型获取给定文本特征的方法：

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

在TensorFlow中的使用方法：

from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

✨ 主要特性

多语言变体支持：提供针对现代标准阿拉伯语（MSA）、方言阿拉伯语（DA）、古典阿拉伯语（CA）以及三者混合的预训练模型。
不同规模模型：除了标准规模的模型，还提供了基于MSA变体按比例缩小的模型（二分之一、四分之一、八分之一和十六分之一）。
广泛的任务适用性：可用于掩码语言建模、下一句预测，并且适合在多种NLP任务上进行微调，如命名实体识别、词性标注、情感分析、方言识别和诗歌分类等。

📦 安装指南

要使用这些模型，你需要安装transformers库，并且版本需大于等于3.5.0：

pip install transformers>=3.5.0

💻 使用示例

基础用法

from transformers import pipeline
unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-arabic-camelbert-ca')
result = unmasker("الهدف من الحياة هو [MASK] .")
print(result)

高级用法

# 在PyTorch中获取文本特征
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
text = "مرحبا يا عالم."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
# 可以进一步处理输出结果，如提取特征等

📚 详细文档

模型描述

CAMeLBERT是一系列基于不同规模和变体的阿拉伯语文本进行预训练的BERT模型集合。我们发布了针对现代标准阿拉伯语（MSA）、方言阿拉伯语（DA）、古典阿拉伯语（CA）的预训练语言模型，以及一个基于三者混合数据预训练的模型。此外，还提供了基于MSA变体按比例缩小的额外模型。详细信息请参考论文 "The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models"。

本模型卡片描述的是CAMeLBERT - CA (bert-base-arabic-camelbert-ca)，这是一个基于古典阿拉伯语（CA）数据集预训练的模型。

属性	详情
模型类型	`bert-base-arabic-camelbert-ca`
训练数据	CA（古典阿拉伯语）：[OpenITI (Version 2020.1.2)](https://zenodo.org/record/3891466#.YEX4 - F0zbzc)

各模型的详细信息如下：

	模型	变体	大小	词数
	`bert-base-arabic-camelbert-mix`	CA,DA,MSA	167GB	17.3B
✔	`bert-base-arabic-camelbert-ca`	CA	6GB	847M
	`bert-base-arabic-camelbert-da`	DA	54GB	5.8B
	`bert-base-arabic-camelbert-msa`	MSA	107GB	12.6B
	`bert-base-arabic-camelbert-msa-half`	MSA	53GB	6.3B
	`bert-base-arabic-camelbert-msa-quarter`	MSA	27GB	3.1B
	`bert-base-arabic-camelbert-msa-eighth`	MSA	14GB	1.6B
	`bert-base-arabic-camelbert-msa-sixteenth`	MSA	6GB	746M

预期用途

你可以将发布的模型用于掩码语言建模或下一句预测任务。不过，该模型主要用于在NLP任务上进行微调，如命名实体识别（NER）、词性标注（POS tagging）、情感分析、方言识别和诗歌分类等。我们的微调代码可在[这里](https://github.com/CAMeL - Lab/CAMeLBERT)获取。

训练数据

CA（古典阿拉伯语）：[OpenITI (Version 2020.1.2)](https://zenodo.org/record/3891466#.YEX4 - F0zbzc)

训练过程

我们使用谷歌发布的[原始实现](https://github.com/google - research/bert)进行预训练。除非另有说明，我们遵循原始英文BERT模型的超参数进行预训练。

预处理

从每个语料库中提取原始文本后，我们进行以下预处理步骤：
- 首先，使用[原始BERT实现](https://github.com/google - research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286 - L297)提供的工具移除无效字符并规范化空格。
- 移除不包含任何阿拉伯字符的行。
- 使用[CAMeL Tools](https://github.com/CAMeL - Lab/camel_tools)移除变音符号和连字符。
- 最后，使用基于启发式的句子分割器将每行分割成句子。
- 使用HuggingFace的tokenizers在整个数据集（167GB文本）上训练一个词汇量为30,000的WordPiece分词器。
- 不将字母小写，也不移除重音符号。

预训练

模型在单个云TPU (v3 - 8) 上总共训练了100万步。
前90,000步使用的批量大小为1,024，其余步骤使用的批量大小为256。
90%的步骤中序列长度限制为128个标记，其余10%的步骤中序列长度限制为512个标记。
使用全词掩码，重复因子为10。
对于最大序列长度为128个标记的数据集，每个序列的最大预测数设置为20；对于最大序列长度为512个标记的数据集，每个序列的最大预测数设置为80。
使用随机种子12345，掩码语言模型概率为0.15，短序列概率为0.1。
使用的优化器是Adam，学习率为1e - 4，\(\beta_{1} = 0.9\)，\(\beta_{2} = 0.999\)，权重衰减为0.01，学习率在10,000步内进行热身，之后线性衰减。

评估结果

我们在五个NLP任务上评估了预训练语言模型：命名实体识别（NER）、词性标注（POS tagging）、情感分析（SA）、方言识别（DID）和诗歌分类。
使用12个数据集对模型进行微调并评估。
使用Hugging Face的transformers库对CAMeLBERT模型进行微调。
使用transformers v3.1.0和PyTorch v1.5.1。
微调通过在最后一个隐藏层添加一个全连接线性层来完成。
使用\(F_{1}\)分数作为所有任务的评估指标。
微调使用的代码可在[这里](https://github.com/CAMeL - Lab/CAMeLBERT)获取。

结果

任务	数据集	变体	混合	CA	DA	MSA	MSA - 1/2	MSA - 1/4	MSA - 1/8	MSA - 1/16
NER	ANERcorp	MSA	80.8%	67.9%	74.1%	82.4%	82.0%	82.1%	82.6%	80.8%
POS	PATB (MSA)	MSA	98.1%	97.8%	97.7%	98.3%	98.2%	98.3%	98.2%	98.2%
	ARZTB (EGY)	DA	93.6%	92.3%	92.7%	93.6%	93.6%	93.7%	93.6%	93.6%
	Gumar (GLF)	DA	97.3%	97.7%	97.9%	97.9%	97.9%	97.9%	97.9%	97.9%
SA	ASTD	MSA	76.3%	69.4%	74.6%	76.9%	76.0%	76.8%	76.7%	75.3%
	ArSAS	MSA	92.7%	89.4%	91.8%	93.0%	92.6%	92.5%	92.5%	92.3%
	SemEval	MSA	69.0%	58.5%	68.4%	72.1%	70.7%	72.8%	71.6%	71.2%
DID	MADAR - 26	DA	62.9%	61.9%	61.8%	62.6%	62.0%	62.8%	62.0%	62.2%
	MADAR - 6	DA	92.5%	91.5%	92.2%	91.9%	91.8%	92.2%	92.1%	92.0%
	MADAR - Twitter - 5	MSA	75.7%	71.4%	74.2%	77.6%	78.5%	77.3%	77.7%	76.2%
	NADI	DA	24.7%	17.3%	20.1%	24.9%	24.6%	24.6%	24.9%	23.8%
诗歌	APCD	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%

结果（平均值）

	变体	混合	CA	DA	MSA	MSA - 1/2	MSA - 1/4	MSA - 1/8	MSA - 1/16
变体平均^{[[1]](#footnote - 1)}	MSA	82.1%	75.7%	80.1%	83.4%	83.0%	83.3%	83.2%	82.3%
	DA	74.4%	72.1%	72.9%	74.2%	74.0%	74.3%	74.1%	73.9%
	CA	79.8%	80.9%	79.6%	79.7%	79.9%	80.0%	79.7%	79.8%
宏平均	ALL	78.7%	74.7%	77.1%	79.2%	79.0%	79.2%	79.1%	78.6%

[1]：变体平均是指对同一语言变体的一组任务进行平均。

🔧 技术细节

预训练实现

使用谷歌发布的[原始实现](https://github.com/google - research/bert)进行预训练，遵循原始英文BERT模型的超参数，除非另有说明。

数据处理

在预处理阶段，对原始文本进行了多步处理，包括移除无效字符、规范化空格、移除无阿拉伯字符的行、移除变音符号和连字符、句子分割以及训练WordPiece分词器等操作。

预训练参数

在单个云TPU (v3 - 8) 上进行训练，设置了不同的批量大小、序列长度、掩码策略、优化器参数等。

📄 许可证

本项目使用Apache - 2.0许可证。

致谢

本研究得到了谷歌TensorFlow研究云（TFRC）提供的云TPU支持。

引用

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}