X-ALMA-13B-Pretrain开源翻译模型 - 即插即用支持50种语言翻译

首页

X ALMA 13B Pretrain

由 haoranxu 开发

X-ALMA是基于ALMA-R扩展的多语言机器翻译模型，支持50种语言，采用即插即用架构和特定语言模块。

大型语言模型

Transformers

支持多种语言开源协议:MIT #多语言机器翻译 #即插即用架构 #50种语言支持

下载量 2,928

发布时间 : 6/27/2024

模型简介

X-ALMA是一个多语言机器翻译模型，通过扩展ALMA-R模型，将支持的语言数量从6种提升到50种。它采用即插即用架构，配备特定语言模块，并搭配精心设计的训练方案。

模型特点

多语言支持

支持50种语言，涵盖多种不同语系的语言。

即插即用架构

采用带有特定语言模块的即插即用架构，搭配精心设计的训练方案。

模块化设计

支持加载基础模型和特定语言模块，或加载合并后的模型，灵活适应不同需求。

模型能力

机器翻译

多语言开放式问答

使用案例

机器翻译

中文到英文翻译

将中文文本翻译成英文。

高质量翻译结果

多语言翻译

支持50种语言之间的互译。

广泛的语言覆盖和高质量的翻译

问答系统

多语言开放式问答

支持多种语言的开放式问答。

准确的回答和广泛的语言支持

🚀 X-ALMA

X-ALMA是基于ALMA-R进行扩展的模型，它将支持的语言数量从6种提升到了50种。该模型采用了即插即用的架构，配备特定语言模块，并搭配精心设计的训练方案。此版本发布了X-ALMA预训练基础模型。

🚀 快速开始

有三种方式可以加载X-ALMA进行翻译。以下是一个将“我爱机器翻译。”翻译成英文的示例（X-ALMA也能够处理多语言开放式问答）。

第一种方式：加载已将特定语言模块合并到基础模型中的合并模型（推荐）

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from peft import PeftModel

GROUP2LANG = {
1: ["da", "nl", "de", "is", "no", "sv", "af"],
2: ["ca", "ro", "gl", "it", "pt", "es"],
3: ["bg", "mk", "sr", "uk", "ru"],
4: ["id", "ms", "th", "vi", "mg", "fr"],
5: ["hu", "el", "cs", "pl", "lt", "lv"],
6: ["ka", "zh", "ja", "ko", "fi", "et"],
7: ["gu", "hi", "mr", "ne", "ur"],
8: ["az", "kk", "ky", "tr", "uz", "ar", "he", "fa"],
}
LANG2GROUP = {lang: str(group) for group, langs in GROUP2LANG.items() for lang in langs}
group_id = LANG2GROUP["zh"]

model = AutoModelForCausalLM.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", padding_side='left')

# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"

# X-ALMA needs chat template but ALMA and ALMA-R don't need it.
chat_style_prompt = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(chat_style_prompt, tokenize=False, add_generation_prompt=True)

input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

第二种方式：加载基础模型和特定语言模块（推荐）

model = AutoModelForCausalLM.from_pretrained("haoranxu/X-ALMA-13B-Pretrain", torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, f"haoranxu/X-ALMA-13B-Group{group_id}")
tokenizer = AutoTokenizer.from_pretrained(f"haoranxu/X-ALMA-13B-Group{group_id}", padding_side='left')

第三种方式：像混合专家模型（MoE）一样加载包含所有特定语言模块的基础模型（需要大显存GPU）

from modeling_xalma import XALMAForCausalLM
model = XALMAForCausalLM.from_pretrained("haoranxu/X-ALMA", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/X-ALMA", padding_side='left')

# Add `lang="zh"`: specify the language to instruct the model on which group to use for the third loading method during generation.
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9, lang="zh")

✨ 主要特性

多语言支持：在ALMA - R的基础上，将支持的语言从6种扩展到了50种，涵盖了多种不同语系的语言。
即插即用架构：采用了带有特定语言模块的即插即用架构，并且搭配精心设计的训练方案。

📦 模型信息

属性	详情
基础模型	haoranxu/ALMA - 13B - Pretrain
训练数据集	oscar - corpus/OSCAR - 2301、allenai/nllb、Helsinki - NLP/opus - 100
支持语言	英语（en）、丹麦语（da）、荷兰语（nl）、德语（de）、冰岛语（is）、挪威语（no）、瑞典语（sv）、南非荷兰语（af）、加泰罗尼亚语（ca）、罗马尼亚语（ro）、加利西亚语（gl）、意大利语（it）、葡萄牙语（pt）、西班牙语（es）、保加利亚语（bg）、马其顿语（mk）、塞尔维亚语（sr）、乌克兰语（uk）、俄语（ru）、印尼语（id）、马来语（ms）、泰语（th）、越南语（vi）、马达加斯加语（mg）、法语（fr）、匈牙利语（hu）、希腊语（el）、捷克语（cs）、波兰语（pl）、立陶宛语（lt）、拉脱维亚语（lv）、格鲁吉亚语（ka）、中文（zh）、日语（ja）、韩语（ko）、芬兰语（fi）、爱沙尼亚语（et）、古吉拉特语（gu）、印地语（hi）、马拉地语（mr）、尼泊尔语（ne）、乌尔都语（ur）、阿塞拜疆语（az）、哈萨克语（kk）、吉尔吉斯语（ky）、土耳其语（tr）、乌兹别克语（uz）、阿拉伯语（ar）、希伯来语（he）、波斯语（fa）

📚 详细文档

模型引用

@misc{xu2024xalmaplugplay,
      title={X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale}, 
      author={Haoran Xu and Kenton Murray and Philipp Koehn and Hieu Hoang and Akiko Eriguchi and Huda Khayrallah},
      year={2024},
      eprint={2410.03115},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.03115}, 
}

模型链接

所有X - ALMA的检查点都发布在Hugging Face上：

模型	模型链接	描述
X - ALMA	[haoranxu/X - ALMA](https://huggingface.co/haoranxu/X - ALMA)	包含所有模块的X - ALMA模型
X - ALMA - 13B - Pretrain	[haoranxu/X - ALMA - 13B - Pretrain](https://huggingface.co/haoranxu/X - ALMA - 13B - Pretrain)	X - ALMA 13B多语言预训练基础模型
X - ALMA - Group1	[haoranxu/X - ALMA - 13B - Group1](https://huggingface.co/haoranxu/X - ALMA - 13B - Group1)	X - ALMA group1特定模块及合并后的模型
X - ALMA - Group2	[haoranxu/X - ALMA - 13B - Group2](https://huggingface.co/haoranxu/X - ALMA - 13B - Group2)	X - ALMA group2特定模块及合并后的模型
X - ALMA - Group3	[haoranxu/X - ALMA - 13B - Group3](https://huggingface.co/haoranxu/X - ALMA - 13B - Group3)	X - ALMA group3特定模块及合并后的模型
X - ALMA - Group4	[haoranxu/X - ALMA - 13B - Group4](https://huggingface.co/haoranxu/X - ALMA - 13B - Group4)	X - ALMA group4特定模块及合并后的模型
X - ALMA - Group5	[haoranxu/X - ALMA - 13B - Group5](https://huggingface.co/haoranxu/X - ALMA - 13B - Group5)	X - ALMA group5特定模块及合并后的模型
X - ALMA - Group6	[haoranxu/X - ALMA - 13B - Group6](https://huggingface.co/haoranxu/X - ALMA - 13B - Group6)	X - ALMA group6特定模块及合并后的模型
X - ALMA - Group7	[haoranxu/X - ALMA - 13B - Group7](https://huggingface.co/haoranxu/X - ALMA - 13B - Group7)	X - ALMA group7特定模块及合并后的模型
X - ALMA - Group8	[haoranxu/X - ALMA - 13B - Group8](https://huggingface.co/haoranxu/X - ALMA - 13B - Group8)	X - ALMA group8特定模块及合并后的模型