library_name: transformers
tags:
- seq2seq
license: apache-2.0
datasets:
- Helsinki-NLP/europarl
- Helsinki-NLP/opus-100
language:
- en
- it
base_model:
- bigscience/mt0-small
pipeline_tag: translation
metrics:
- bleu
🍀 四叶草 - 专攻英译意的小型翻译模型
四叶草是基于bigscience/mt0-small
架构的编码器-解码器转换器模型,专注于英语-意大利语文本翻译。该模型训练数据来源于Helsinki-NLP/opus-100
和Helsinki-NLP/europarl
中的英意对照语料。
使用指南
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("LeonardPuettmann/Quadrifoglio-mt-en-it")
model = AutoModelForSeq2SeqLM.from_pretrained("LeonardPuettmann/Quadrifoglio-mt-en-it")
def generate_response(input_text):
input_ids = tokenizer("将英语翻译为意大利语:" + input_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=256)
return tokenizer.decode(output[0], skip_special_tokens=True)
text_to_translate = "请给我一杯绿茶。"
response = generate_response(text_to_translate)
print(response)
由于本模型针对单句翻译优化,处理长文本时建议先分句(推荐使用SpaCy),翻译完成后再合并结果:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import spacy
nlp = spacy.load("en_core_web_sm")
tokenizer = AutoTokenizer.from_pretrained("LeonardPuettmann/Quadrifoglio-mt-en-it")
model = AutoModelForSeq2SeqLM.from_pretrained("LeonardPuettmann/Quadrifoglio-mt-en-it")
def generate_response(input_text):
input_ids = tokenizer("将意大利语翻译为英语: " + input_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=256)
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "近来可好?今日天气晴好。愿你诸事顺遂。"
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
sentence_translations = []
for i, sentence in enumerate(sentences):
sentence_translation = generate_response(sentence)
sentence_translations.append(sentence_translation)
full_translation = " ".join(sentence_translations)
print(full_translation)
性能评估
基于Opus 100测试集
BLEU指标
|
四叶草(本模型) |
mt0-small |
DeepL |
BLEU分数 |
0.4816 |
0.0159 |
0.5210 |
1-gram准确率 |
0.7305 |
0.2350 |
0.7613 |
2-gram准确率 |
0.5413 |
0.0290 |
0.5853 |
3-gram准确率 |
0.4289 |
0.0076 |
0.4800 |
4-gram准确率 |
0.3417 |
0.0013 |
0.3971 |