language: fr
🚨 更新: 此检查点已弃用,请改用 https://huggingface.co/almanach/camembert-base 🚨
CamemBERT:一款美味的法语语言模型
简介
CamemBERT 是基于 RoBERTa 模型的最先进法语语言模型。
该模型现已在 Hugging Face 上提供 6 种不同版本,参数数量、预训练数据量及预训练数据源领域各不相同。
更多信息或请求,请访问 Camembert 官网
预训练模型
模型 |
参数量 |
架构 |
训练数据 |
camembert-base |
1.1亿 |
Base |
OSCAR(138 GB 文本) |
camembert/camembert-large |
3.35亿 |
Large |
CCNet(135 GB 文本) |
camembert/camembert-base-ccnet |
1.1亿 |
Base |
CCNet(135 GB 文本) |
camembert/camembert-base-wikipedia-4gb |
1.1亿 |
Base |
维基百科(4 GB 文本) |
camembert/camembert-base-oscar-4gb |
1.1亿 |
Base |
OSCAR 子集(4 GB 文本) |
camembert/camembert-base-ccnet-4gb |
1.1亿 |
Base |
CCNet 子集(4 GB 文本) |
如何在 HuggingFace 上使用 CamemBERT
加载 CamemBERT 及其子词分词器:
from transformers import CamembertModel, CamembertTokenizer
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb")
camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb")
camembert.eval()
使用 pipeline 进行掩码填充
from transformers import pipeline
camembert_fill_mask = pipeline("fill-mask", model="camembert/camembert-base-wikipedia-4gb", tokenizer="camembert/camembert-base-wikipedia-4gb")
results = camembert_fill_mask("Le camembert est un fromage de <mask>!")
从 Camembert 输出中提取上下文嵌入特征
import torch
tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
encoded_sentence = tokenizer.encode(tokenized_sentence)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)
从所有 Camembert 层提取上下文嵌入特征
from transformers import CamembertConfig
config = CamembertConfig.from_pretrained("camembert/camembert-base-wikipedia-4gb", output_hidden_states=True)
camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb", config=config)
embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
all_layer_embeddings[5]
作者
CamemBERT 由 Louis Martin*、Benjamin Muller*、Pedro Javier Ortiz Suárez*、Yoann Dupont、Laurent Romary、Éric Villemonte de la Clergerie、Djamé Seddah 和 Benoît Sagot 训练和评估。
引用
如使用我们的工作,请引用:
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}