许可证:其他
许可证名称:ntt-license
许可证链接:LICENSE
语言:
- 日语
- 英语
任务标签:翻译
库名称:fairseq
标签:
- 神经机器翻译
明芝(MingShiba)开发的Sugoi v4 日语→英语神经机器翻译模型
- 官网:https://sugoitoolkit.com
- 博客:https://blog.sugoitoolkit.com
- 赞助支持:https://www.patreon.com/mingshiba
如何通过Python下载本模型
- 安装Python:https://www.python.org/downloads/
- 打开命令提示符(cmd)
- 验证Python版本:
python --version
- 安装依赖库:
python -m pip install huggingface_hub
- 执行Python脚本:
import huggingface_hub
huggingface_hub.download_snapshot('entai2965/sugoi-v4-ja-en-ctranslate2',local_dir='sugoi-v4-ja-en-ctranslate2')
如何运行本模型(批量处理版)
- 参考文档:https://opennmt.net/CTranslate2/guides/fairseq.html#fairseq
- 打开命令提示符(cmd)
- 安装依赖库:
python -m pip install ctranslate2 sentencepiece
- 执行Python脚本:
import ctranslate2
import sentencepiece
model_path='sugoi-v4-ja-en-ctranslate2'
sentencepiece_model_path=model_path+'/spm'
device='cpu'
string1='は静かに前へと歩み出た。'
string2='悲しいGPTと話したことがありますか?'
raw_list=[string1,string2]
translator = ctranslate2.Translator(model_path, device=device)
tokenizer_for_source_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.ja.nopretok.model')
tokenizer_for_target_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.en.nopretok.model')
tokenized_batch=[]
for text in raw_list:
tokenized_batch.append(tokenizer_for_source_language.encode(text,out_type=str))
translated_batch=translator.translate_batch(source=tokenized_batch,beam_size=5)
assert(len(raw_list)==len(translated_batch))
for count,tokens in enumerate(translated_batch):
translated_batch[count]=tokenizer_for_target_language.decode(tokens.hypotheses[0]).replace('<unk>','')
for text in translated_batch:
print(text)
函数式编程版本
import ctranslate2
import sentencepiece
model_path='sugoi-v4-ja-en-ctranslate2'
sentencepiece_model_path=model_path+'/spm'
device='cpu'
string1='は静かに前へと歩み出た。'
string2='悲しいGPTと話したことがありますか?'
raw_list=[string1,string2]
translator = ctranslate2.Translator(model_path, device=device)
tokenizer_for_source_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.ja.nopretok.model')
tokenizer_for_target_language = sentencepiece.SentencePieceProcessor(sentencepiece_model_path+'/spm.en.nopretok.model')
translated_batch=[tokenizer_for_target_language.decode(tokens.hypotheses[0]).replace('<unk>','') for tokens in translator.translate_batch(source=[tokenizer_for_source_language.encode(text,out_type=str) for text in raw_list],beam_size=5)]
assert(len(raw_list)==len(translated_batch))
for text in translated_batch:
print(text)