Dragoman开源英乌翻译模型 - 实现句子级免费高效精准翻译

首页

Dragoman

由 lang-uk 开发

Dragoman 是一个句子级别的英乌翻译模型，采用两阶段训练流程，在FLORES-101英乌开发测试子集上取得了BLEU值为32.34的最优性能。

机器翻译

Safetensors

支持多种语言开源协议:Apache-2.0 #英乌翻译 #两阶段训练 #BLEU32.34

下载量 407

发布时间 : 4/14/2024

模型简介

专为英语到乌克兰语句子级别翻译设计的模型，采用Mistral-7B基础模型和两阶段训练流程。

模型特点

两阶段训练流程

先在Paracrawl数据集预训练，后在Multi30k-uk上进行无监督数据筛选

当前最优性能

在FLORES-101英乌开发测试子集上取得BLEU值32.34

高效微调

使用PEFT(Parameter-Efficient Fine-Tuning)技术进行微调

模型能力

英语到乌克兰语句子翻译

高质量机器翻译

使用案例

机器翻译

句子级别翻译

将英语句子翻译成乌克兰语

在FLORES-101测试集上达到32.34 BLEU分数

🚀 Dragoman：英乌机器翻译模型

Dragoman 是一款句级别的最先进英乌翻译模型，它能有效解决英文到乌克兰语的句子翻译问题，为相关领域提供了高质量的翻译解决方案。

🚀 快速开始

运行模型

# pip install bitsandbytes transformers peft torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftConfig, PeftModel
import torch

config = PeftConfig.from_pretrained("lang-uk/dragoman")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", quantization_config=quant_config
)
model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
)

input_text = "[INST] who holds this neighborhood? [/INST]" # 模型输入应遵循此格式
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, num_beams=10)
print(tokenizer.decode(outputs[0]))

在苹果电脑上使用 mlx - lm 运行模型

我们将 Dragoman PT 适配器合并到基础模型中，并将模型的量化版本上传到了 https://huggingface.co/lang-uk/dragoman-4bit。你可以使用 mlx - lm 运行该模型。

python -m mlx_lm.generate --model lang-uk/dragoman-4bit --prompt '[INST] who holds this neighborhood? [/INST]' --temp 0 --max-tokens 100

MLX 是在配备 M1 芯片及更新版本的苹果电脑上使用该语言模型的推荐方式。

使用 llama.cpp 运行模型

我们将 Dragoman PT 适配器转换为 GGLA 格式。你可以下载 GGUF 格式的 Mistral - 7B - v0.1 基础模型（例如 mistral - 7b - v0.1.Q4_K_M.gguf），并像这样使用此仓库中的 ggml - adapter - model.bin：

./main -ngl 32 -m mistral-7b-v0.1.Q4_K_M.gguf --color -c 4096 --temp 0 --repeat_penalty 1.1 -n -1 -p "[INST] who holds this neighborhood? [/INST]" --lora ./ggml-adapter-model.bin

✨ 主要特性

Dragoman 是句级别的最先进英乌翻译模型，采用两阶段训练流程，先在清理后的 Paracrawl 数据集上进行预训练，再在 turuta/Multi30k - uk 上进行无监督数据选择阶段的训练。
通过两阶段的数据清理和数据选择方法，在 FLORES - 101 英乌开发测试子集上实现了 SOTA 性能，BLEU 达到 32.34。

📦 安装指南

运行模型前，你需要安装相关依赖：

pip install bitsandbytes transformers peft torch

📚 详细文档

模型详情

属性	详情
开发者	Yurii Paniv、Dmytro Chaplynskyi、Nikita Trynus、Volodymyr Kyrylov
模型类型	翻译模型
源语言	英语
目标语言	乌克兰语
许可证	Apache 2.0

模型用例

本模型专为句级别的英语到乌克兰语翻译而设计，不保证在多句文本上的性能，请知悉。

训练数据集和资源

训练代码：lang - uk/dragoman
清理后的 Paracrawl：lang - uk/paracrawl_3m
清理后的 Multi30K：lang - uk/multi30k - extended - 17k

FLORES - 101 开发集上与其他模型的基准测试结果

模型	BLEU $\uparrow$	spBLEU	chrF	chrF++
微调模型
Dragoman P, 10 beams	30.38	37.93	59.49	56.41
Dragoman PT, 10 beams	32.34	39.93	60.72	57.82
零样本和少样本模型
LLaMa - 2 - 7B 2 - shot	20.1	26.78	49.22	46.29
RWKV - 5 - World - 7B 0 - shot	21.06	26.20	49.46	46.46
gpt - 4 10 - shot	29.48	37.94	58.37	55.38
gpt - 4 - turbo - preview 0 - shot	30.36	36.75	59.18	56.19
Google Translate 0 - shot	25.85	32.49	55.88	52.48
预训练模型
NLLB 3B, 10 beams	30.46	37.22	58.11	55.32
OPUS - MT, 10 beams	32.2	39.76	60.23	57.38

📄 许可证

本模型使用 Apache 2.0 许可证。

🔧 技术细节

本模型采用两阶段训练流程：

预训练阶段：在清理后的 Paracrawl 数据集上进行预训练。
无监督数据选择阶段：在 turuta/Multi30k - uk 上进行无监督数据选择阶段的训练。通过这种两阶段的数据清理和数据选择方法，在 FLORES - 101 英乌开发测试子集上实现了 SOTA 性能。

📚 引用

@inproceedings{paniv-etal-2024-dragoman,
    title = "Setting up the Data Printer with Improved {E}nglish to {U}krainian Machine Translation",
    author = "Paniv, Yurii  and
      Chaplynskyi, Dmytro  and
      Trynus, Nikita  and
      Kyrylov, Volodymyr",
    editor = "Romanyshyn, Mariana  and
      Romanyshyn, Nataliia  and
      Hlybovets, Andrii  and
      Ignatenko, Oleksii",
    booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.unlp-1.6",
    pages = "41--50",
    abstract = "To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning of a large pretrained language model with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences followed by a second phase of training using 17K examples selected by k-fold perplexity filtering on another dataset of higher quality. Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.",
}