Indictrans2开源印度语系互译模型 - 免费支持22种印度官方语言互译

首页

Indictrans2 Indic Indic Dist 320M

由 ai4bharat 开发

印度语系互译2是一个支持22种印度官方语言互译的高质量机器翻译模型，基于320M参数的蒸馏变体。

机器翻译

Transformers

开源协议:MIT #印度语系互译 #多语言翻译 #高精度翻译

下载量 4,254

发布时间 : 11/28/2023

模型简介

该模型专门用于印度22种官方语言之间的互译任务，通过蒸馏技术优化了翻译质量和效率。

模型特点

多语言支持

支持22种印度官方语言之间的互译

高质量翻译

通过蒸馏技术优化翻译质量

高效推理

支持flash_attention加速推理

模型能力

文本翻译

多语言互译

跨语言转换

使用案例

跨语言交流

政府文件翻译

将政府文件在不同印度语言间转换

新闻内容本地化

将新闻内容翻译为不同地区语言

教育应用

教材翻译

将教育材料翻译为不同语言版本

🚀 IndicTrans2

IndicTrans2是一款用于印度语系语言翻译的模型，此模型为Indic - Indic Distilled 320M变体，是在拼接Indic - En Distilled 200M和En - Indic Distilled 200M变体后进行适配的。它能助力实现印度多种语言间的高质量翻译，推动多语言交流。

✨ 主要特性

多语言支持：支持多种印度语系语言，如阿萨姆语（as）、孟加拉语（bn）等22种语言。
多领域数据训练：使用了如flores - 200、IN22 - Gen、IN22 - Conv等数据集进行训练。
多评估指标：使用了如BLEU、chrF、chrF++、COMET等评估指标。

📦 安装指南

文档中未提及具体安装步骤，可参考github仓库获取使用HF兼容的IndicTrans2模型进行推理的详细说明。

💻 使用示例

基础用法

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
# recommended to run this on a gpu with flash_attn installed
# don't set attn_implemetation if you don't have flash_attn
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

src_lang, tgt_lang = "hin_Deva", "tam_Taml"
model_name = "ai4bharat/indictrans2-indic-indic-dist-320M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    torch_dtype=torch.float16, # performance might slightly vary for bfloat16
    attn_implementation="flash_attention_2"
).to(DEVICE)

ip = IndicProcessor(inference=True)

input_sentences = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

batch = ip.preprocess_batch(
    input_sentences,
    src_lang=src_lang,
    tgt_lang=tgt_lang,
)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Tokenize the sentences and generate input encodings
inputs = tokenizer(
    batch,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(DEVICE)

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(
        **inputs,
        use_cache=True,
        min_length=0,
        max_length=256,
        num_beams=5,
        num_return_sequences=1,
    )

# Decode the generated tokens into text
generated_tokens = tokenizer.batch_decode(
    generated_tokens,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

# Postprocess the translations, including entity replacement
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

📚 详细文档

如需了解模型训练、数据和评估指标的更多详细信息，请参考博客。

📄 许可证

本模型采用MIT许可证。

📖 引用

如果您考虑使用我们的工作，请使用以下引用：

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

模型信息表格

属性	详情
支持语言	as、bn、brx、doi、gom、gu、hi、kn、ks、mai、ml、mr、mni、ne、or、pa、sa、sat、snd、ta、te、ur
语言详情	asm_Beng, ben_Beng, brx_Deva, doi_Deva, gom_Deva, guj_Gujr, hin_Deva, kan_Knda, kas_Arab, mai_Deva, mal_Mlym, mar_Deva, mni_Mtei, npi_Deva, ory_Orya, pan_Guru, san_Deva, sat_Olck, snd_Deva, tam_Taml, tel_Telu, urd_Arab
标签	indictrans2、translation、ai4bharat、multilingual
许可证	MIT
训练数据集	flores - 200、IN22 - Gen、IN22 - Conv
评估指标	bleu、chrf、chrf++、comet
推理功能	不支持