bert-base-romanian-cased-v1开源模型 - 精准支持罗马尼亚语处理，免费可用

首页

Bert Base Romanian Cased V1

由 dumitrescustefan 开发

这是一个针对罗马尼亚语的BERT基础模型，区分大小写，基于15GB语料库训练。

大型语言模型其他开源协议:MIT #罗马尼亚语BERT #区分大小写 #自然语言处理

下载量 6,466

发布时间 : 3/2/2022

模型简介

该模型是基于BERT架构的罗马尼亚语预训练模型，适用于各种自然语言处理任务。

模型特点

罗马尼亚语专用

专门针对罗马尼亚语训练，相比多语言模型有更好的性能表现。

区分大小写

模型能够识别和处理大小写字母的区别。

大规模训练数据

基于15GB的罗马尼亚语语料库训练，包含多种来源的数据。

模型能力

文本编码

语言理解

命名实体识别

词性标注

使用案例

自然语言处理

词性标注

对罗马尼亚语文本进行词性标注

在UPOS任务上达到98.00%的准确率

命名实体识别

识别罗马尼亚语文本中的命名实体

在RONEC数据集上达到85.88%的F1分数

🚀 罗马尼亚语基础大小写敏感BERT模型v1

这是一个针对罗马尼亚语的BERT 基础、大小写敏感 模型，在15GB的语料库上进行训练，版本为。该模型可用于解决罗马尼亚语相关的自然语言处理任务，如词性标注、命名实体识别等，为罗马尼亚语的文本处理提供了强大的支持。

🚀 快速开始

如何使用

from transformers import AutoTokenizer, AutoModel
import torch
# 加载分词器和模型
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# 对句子进行分词并通过模型处理
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # 批量大小为1
outputs = model(input_ids)
# 获取编码
last_hidden_states = outputs[0]  # 最后一个隐藏状态是输出元组的第一个元素

⚠️ 重要提示

请始终对文本进行清理！将 s 和 t 的软音符字母替换为逗号字母，使用以下代码：

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

因为该模型未在软音符 s 和 t 上进行训练。如果不进行替换，由于 <UNK> 标记的存在，性能将会下降，并且每个单词的标记数量会增加。

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModel
import torch
# 加载分词器和模型
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# 对句子进行分词并通过模型处理
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # 批量大小为1
outputs = model(input_ids)
# 获取编码
last_hidden_states = outputs[0]  # 最后一个隐藏状态是输出元组的第一个元素

高级用法

# 对文本进行清理并进行预测
text = "Acesta este un test."
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]

📚 详细文档

评估

评估在通用依存关系罗马尼亚语RRT 的UPOS、XPOS和LAS上进行，以及基于 RONEC 的命名实体识别（NER）任务上进行。详细信息以及更多未在此处展示的深度测试，可在专门的评估页面中找到。

基线模型是多语言BERT 模型 bert-base-multilingual-(un)cased，在撰写本文时，它是唯一可用于罗马尼亚语的BERT模型。

模型	UPOS	XPOS	NER	LAS
bert-base-multilingual-cased	97.87	96.16	84.13	88.04
bert-base-romanian-cased-v1	98.00	96.46	85.88	89.69

语料库

该模型在以下语料库上进行训练（下表中的统计数据是清理后的结果）：

语料库	行数（百万）	单词数（百万）	字符数（十亿）	大小（GB）
OPUS	55.05	635.04	4.045	3.8
OSCAR	33.56	1725.82	11.411	11
维基百科	1.54	60.47	0.411	0.4
总计	90.15	2421.33	15.867	15.2

引用

如果您在研究论文中使用此模型，请引用以下论文：

Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.

或者，使用BibTeX格式：

@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}