语言支持:
- 阿萨姆语
- 孟加拉语
- 博多语
- 多格里语
- 英语
- 孔卡尼语
- 古吉拉特语
- 印地语
- 卡纳达语
- 克什米尔语(阿拉伯文)
- 克什米尔语(天城文)
- 迈蒂利语
- 马拉雅拉姆语
- 马拉地语
- 曼尼普尔语(孟加拉文)
- 曼尼普尔语(梅泰文)
- 尼泊尔语
- 奥里亚语
- 旁遮普语
- 梵语
- 桑塔利语(奥尔奇文)
- 信德语(阿拉伯文)
- 信德语(天城文)
- 泰米尔语
- 泰卢固语
- 乌尔都语
语言详情:
阿萨姆语_孟加拉文,孟加拉语_孟加拉文,博多语_天城文,多格里语_天城文,英语_拉丁文,孔卡尼语_天城文,古吉拉特语_古吉拉特文,印地语_天城文,卡纳达语_卡纳达文,克什米尔语_阿拉伯文,克什米尔语_天城文,迈蒂利语_天城文,马拉雅拉姆语_马拉雅拉姆文,马拉地语_天城文,曼尼普尔语_孟加拉文,曼尼普尔语_梅泰文,尼泊尔语_天城文,奥里亚语_奥里亚文,旁遮普语_古鲁穆奇文,梵语_天城文,桑塔利语_奥尔奇文,信德语_阿拉伯文,信德语_天城文,泰米尔语_泰米尔文,泰卢固语_泰卢固文,乌尔都语_阿拉伯文
标签:
许可证:MIT
数据集:
- flores-200
- IN22通用数据集
- IN22会话数据集
评估指标:
推理支持:不支持
印度语言互译2(IndicTrans2)
这是IndicTrans2英语-印度语言蒸馏版200M参数的模型卡片。
关于模型训练、数据和指标的详细信息,请参阅TMLR投稿中的第7.6节:蒸馏模型。
使用说明
关于如何使用兼容Hugging Face的IndicTrans2模型进行推理,请参考GitHub仓库的详细说明。
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
src_lang, tgt_lang = "eng_Latn", "hin_Deva"
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
).to(DEVICE)
ip = IndicProcessor(inference=True)
input_sentences = [
"When I was young, I used to go to the park every day.",
"We watched a new movie last week, which was very inspiring.",
"If you had met me at that time, we would have gone out to eat.",
"My friend has invited me to his birthday party, and I will give him a gift.",
]
batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)
inputs = tokenizer(
batch,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
generated_tokens = tokenizer.batch_decode(
generated_tokens,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
for input_sentence, translation in zip(input_sentences, translations):
print(f"{src_lang}: {input_sentence}")
print(f"{tgt_lang}: {translation}")
📢 长上下文IT2模型
- 新推出的基于RoPE的IndicTrans2模型能够处理最多2048个标记的序列长度,可在此处获取。
- 只需更改
model_name
参数即可使用这些模型。有关生成的更多信息,请阅读RoPE-IT2模型的模型卡片。
- 建议使用
flash_attention_2
运行这些模型以提高生成效率。
引用
如果您使用我们的工作,请引用:
@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}