IndicTrans2开源机器翻译模型 - 支持22种印度语言与英语高质量互译

首页

Indictrans2 En Indic Dist 200M

由 ai4bharat 开发

IndicTrans2是一个支持22种印度语言与英语互译的高质量机器翻译模型，本版本为200M参数的蒸馏版

机器翻译

Transformers

支持多种语言开源协议:MIT #印度多语言翻译 #低资源优化 #天城文支持

下载量 4,461

发布时间 : 9/12/2023

模型简介

该模型专注于英语与22种印度语言之间的高质量机器翻译，采用蒸馏技术优化了模型大小与性能平衡

模型特点

多语言支持

支持22种印度语言与英语之间的互译

高效蒸馏模型

200M参数的蒸馏版本，在保持性能的同时减小模型规模

长上下文支持

RoPE版本可处理最长2048个标记的序列（需使用特定版本）

多种文字系统支持

支持多种印度语言的文字系统（如天城文、阿拉伯文等）

模型能力

英语到印度语言翻译

印度语言到英语翻译

印度语言之间互译

长文本翻译（RoPE版本）

使用案例

多语言内容创作

多语言网站内容翻译

将英语网站内容翻译为多种印度语言

提高印度地区用户的可访问性

政府服务

官方文件翻译

将政府公告翻译为多种印度语言版本

促进多语言地区的政务信息传达

教育

教学材料本地化

将英语教材翻译为学生母语版本

提高非英语母语学生的学习效果

🚀 IndicTrans2

IndicTrans2是一款用于英语到印度语系翻译的模型，其2亿参数的蒸馏版本能为多种印度语言提供高效准确的翻译服务。本模型在多语言处理方面表现出色，可助力不同印度语言间的交流。

🚀 快速开始

请参考 TMLR提交文档的第7.6节：蒸馏模型，以获取关于模型训练、数据和评估指标的更多详细信息。

如需了解如何使用与Hugging Face兼容的IndicTrans2模型进行推理，请参考 GitHub仓库中的详细说明。

💻 使用示例

基础用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "When I was young, I used to go to the park every day.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "My friend has invited me to his birthday party, and I will give him a gift.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

高级用法

长上下文IT2模型：
- 基于RoPE的新型IndicTrans2模型能够处理 长达2048个标记 的序列，可在此处获取。
- 只需更改 model_name 参数即可使用这些模型。有关生成的更多信息，请阅读RoPE - IT2模型的模型卡片。
- 建议使用 flash_attention_2 运行这些模型，以实现高效生成。

📚 详细文档

支持语言

属性	详情
支持的语言代码	as、bn、brx、doi、en、gom、gu、hi、kn、ks、kas、mai、ml、mr、mni、mnb、ne、or、pa、sa、sat、sd、snd、ta、te、ur
语言详情	asm_Beng、ben_Beng、brx_Deva、doi_Deva、eng_Latn、gom_Deva、guj_Gujr、hin_Deva、kan_Knda、kas_Arab、kas_Deva、mai_Deva、mal_Mlym、mar_Deva、mni_Beng、mni_Mtei、npi_Deva、ory_Orya、pan_Guru、san_Deva、sat_Olck、snd_Arab、snd_Deva、tam_Taml、tel_Telu、urd_Arab

许可证

本项目采用MIT许可证。

数据集

flores - 200
IN22 - Gen
IN22 - Conv

评估指标

bleu
chrf
chrf++
comet

推理设置

推理功能已关闭。

📄 许可证

本项目使用MIT许可证。

📖 引用

如果您考虑使用我们的工作，请使用以下引用：

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}