XGLM-564M开源多语言语言模型 - 基于30种语言训练支持多样文本应用

首页

Xglm 564M

由 facebook 开发

XGLM-564M 是一个多语言自回归语言模型，包含5.64亿参数，基于30种语言的平衡语料库训练，总计5000亿子词。

大型语言模型支持多种语言开源协议:MIT #多语言生成 #零样本学习 #低资源优化

下载量 11.13k

发布时间 : 4/25/2025

模型简介

XGLM-564M 是一个多语言自回归语言模型，支持30种语言，适用于多语言文本生成和理解任务。

模型特点

多语言支持

支持30种语言，涵盖多种语系和低资源语言。

平衡语料库

基于30种语言的平衡语料库训练，总计5000亿子词。

自回归模型

采用自回归架构，适用于文本生成任务。

模型能力

多语言文本生成

多语言文本理解

零样本学习

使用案例

自然语言处理

多语言文本生成

生成多种语言的连贯文本。

零样本学习

在少量示例或无示例的情况下进行任务学习。

🚀 XGLM-564M

XGLM-564M 是一个多语言自回归语言模型（拥有 5.64 亿个参数），它在一个由 30 种不同语言组成的平衡语料库上进行训练，语料库总共包含 5000 亿个子词单元。该模型在论文 Few-shot Learning with Multilingual Language Models 中被提出，论文作者包括 Xi Victoria Lin*、Todor Mihaylov、Mikel Artetxe、Tianlu Wang、Shuohui Chen、Daniel Simig、Myle Ott、Naman Goyal、Shruti Bhosale、Jingfei Du、Ramakanth Pasunuru、Sam Shleifer、Punit Singh Koura、Vishrav Chaudhary、Brian O'Horo、Jeff Wang、Luke Zettlemoyer、Zornitsa Kozareva、Mona Diab、Veselin Stoyanov、Xian Li*（*同等贡献）。其原始实现发布在此仓库中。

✨ 主要特性

支持多种语言：涵盖英语、俄语、中文、德语、西班牙语等 30 种语言。
基于大规模语料库训练：使用包含 5000 亿个子词单元的平衡语料库进行训练。

📚 详细文档

训练数据统计

XGLM-564M 的训练数据统计信息如下表所示：

ISO-639-1 代码	语系	语言名称	词元数量	比例	低资源语言上采样后的比例
en	印欧语系	英语	803526736124	0.489906	0.3259
ru	印欧语系	俄语	147791898098	0.0901079	0.0602
zh	汉藏语系	中文	132770494630	0.0809494	0.0483
de	印欧语系	德语	89223707856	0.0543992	0.0363
es	印欧语系	西班牙语	87303083105	0.0532282	0.0353
fr	印欧语系	法语	77419639775	0.0472023	0.0313
ja	日本语系	日语	66054364513	0.040273	0.0269
it	印欧语系	意大利语	41930465338	0.0255648	0.0171
pt	印欧语系	葡萄牙语	36586032444	0.0223063	0.0297
el	印欧语系	希腊语（现代）	28762166159	0.0175361	0.0233
ko	朝鲜语系	韩语	20002244535	0.0121953	0.0811
fi	乌拉尔语系	芬兰语	16804309722	0.0102455	0.0681
id	南岛语系	印尼语	15423541953	0.00940365	0.0125
tr	突厥语系	土耳其语	12413166065	0.00756824	0.0101
ar	亚非语系	阿拉伯语	12248607345	0.00746791	0.0099
vi	南亚语系	越南语	11199121869	0.00682804	0.0091
th	壮侗语系	泰语	10842172807	0.00661041	0.044
bg	印欧语系	保加利亚语	9703797869	0.00591635	0.0393
ca	印欧语系	加泰罗尼亚语	7075834775	0.0043141	0.0287
hi	印欧语系	印地语	3448390110	0.00210246	0.014
et	乌拉尔语系	爱沙尼亚语	3286873851	0.00200399	0.0133
bn	印欧语系	孟加拉语	1627447450	0.000992245	0.0066
ta	达罗毗荼语系	泰米尔语	1476973397	0.000900502	0.006
ur	印欧语系	乌尔都语	1351891969	0.000824241	0.0055
sw	尼日尔 - 刚果语系	斯瓦希里语	907516139	0.000553307	0.0037
te	达罗毗荼语系	泰卢固语	689316485	0.000420272	0.0028
eu	孤立语言	巴斯克语	105304423	6.42035e - 05	0.0043
my	汉藏语系	缅甸语	101358331	6.17976e - 05	0.003
ht	克里奥尔语	海地克里奥尔语	86584697	5.27902e - 05	0.0035
qu	克丘亚语系	克丘亚语	3236108	1.97304e - 06	0.0001

模型卡片

关于该模型的预期用途，请参考 XGLM-564M 开发团队发布的模型卡片。

💻 使用示例

基础用法

以下代码片段展示了如何在合理替代选择（COPA）任务上评估我们的模型（GPT - 3 风格，零样本），使用英语、中文和印地语的示例。

import torch
import torch.nn.functional as F

from transformers import XGLMTokenizer, XGLMForCausalLM

tokenizer = XGLMTokenizer.from_pretrained("facebook/xglm-564M")
model = XGLMForCausalLM.from_pretrained("facebook/xglm-564M")

data_samples = {
    'en': [
        {
            "premise": "I wanted to conserve energy.",
            "choice1": "I swept the floor in the unoccupied room.",
            "choice2": "I shut off the light in the unoccupied room.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "The flame on the candle went out.",
            "choice1": "I blew on the wick.",
            "choice2": "I put a match to the wick.",
            "question": "cause",
            "label": "0"
        }
    ],
    'zh': [
        {
            "premise": "我想节约能源。",
            "choice1": "我在空着的房间里扫了地板。",
            "choice2": "我把空房间里的灯关了。",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "蜡烛上的火焰熄灭了。",
            "choice1": "我吹灭了灯芯。",
            "choice2": "我把一根火柴放在灯芯上。",
            "question": "cause",
            "label": "0"
        }
    ],
    'hi': [
        {
            "premise": "M te vle konsève enèji.",
            "choice1": "Mwen te fin baleye chanm lib la.",
            "choice2": "Mwen te femen limyè nan chanm lib la.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "Flam bouji a te etenn.",
            "choice1": "Mwen te soufle bouji a.",
            "choice2": "Mwen te limen mèch bouji a.",
            "question": "cause",
            "label": "0"
        }
    ]
}

def get_logprobs(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
    outputs = model(**inputs, labels=input_ids)
    logits = outputs.logits
    logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
    return logprobs

# Zero-shot evaluation for the Choice of Plausible Alternatives (COPA) task.
# A return value of 0 indicates that the first alternative is more plausible,
# while 1 indicates that the second alternative is more plausible.
def COPA_eval(prompt, alternative1, alternative2):
    lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
    return 0 if lprob1 > lprob2 else 1

for lang in data_samples_long:
    for idx, example in enumerate(data_samples_long[lang]):
        predict = COPA_eval(example["premise"], example["choice1"], example["choice2"])
        print(f'{lang}-{idx}', predict, example['label'])
        
# en-0 1 1
# en-1 0 0
# zh-0 1 1
# zh-1 0 0
# hi-0 1 1
# hi-1 0 0