标签:
- 由训练器自动生成
数据集: UrukHan/wav2vec2-russian
小部件示例:
- 文本: "西方在俄罗斯启动乌克兰非军事化特别行动后,实施了多轮新经济制裁。克里姆林宫称新限制措施严厉,但表示俄罗斯已提前做好准备。"
模型索引:
- 名称: t5-russian-summarization
结果: []
t5-russian-summarization
用于校正音频转写文本的模型。我的音频识别模型位于https://huggingface.co/UrukHan/wav2vec2-russian,其输出结果可输入本模型处理。测试数据来自随机YouTube视频。
输入 |
输出 |
西方在俄罗斯启动乌克兰非军事化特别行动后,实施了多轮新经济制裁。克里姆林宫称新限制措施严厉,但表示俄罗斯已提前做好准备。 |
西方对俄罗斯实施新制裁 |
训练数据集:
UrukHan/t5-russian-summarization:https://huggingface.co/datasets/UrukHan/t5-russian-summarization
使用示例及Colab操作指南
完整带注释的Colab笔记本:https://colab.research.google.com/drive/1ame2va9_NflYqy4RZ07HYmQ0moJYy7w2?usp=sharing
!pip install transformers
from transformers import AutoModelForSeq2SeqLM, T5TokenizerFast
MODEL_NAME = 'UrukHan/t5-russian-summarization'
MAX_INPUT = 256
tokenizer = T5TokenizerFast.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
input_sequences = ['西方在俄罗斯启动乌克兰非军事化特别行动后...']
task_prefix = "拼写校正: "
encoded = tokenizer(
[task_prefix + seq for seq in input_sequences],
padding="longest",
max_length=MAX_INPUT,
truncation=True,
return_tensors="pt",
)
predicts = model.generate(encoded)
tokenizer.batch_decode(predicts, skip_special_tokens=True)
训练配置指南
定制化Colab训练笔记本:https://colab.research.google.com/drive/1H4IoasDqa2TEjGivVDp-4Pdpm0oxrCWd?usp=sharing
!pip install datasets transformers sentencepiece rouge_score
!apt install git-lfs
from transformers import Seq2SeqTrainer, T5TokenizerFast, AutoModelForSeq2SeqLM
REPO = "t5-russian-summarization"
MODEL_NAME = "UrukHan/t5-russian-summarization"
MAX_INPUT = 256
MAX_OUTPUT = 64
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(inputs, max_length=MAX_INPUT, truncation=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["summary"], max_length=MAX_OUTPUT, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
数据格式转换示例
input_data = ['西方在俄罗斯...']
output_data = ['西方实施新制裁']
task_prefix = "拼写校正: "
encoding = tokenizer(
[task_prefix + seq for seq in input_data],
padding="longest",
max_length=MAX_INPUT,
truncation=True,
return_tensors="pt"
)
labels = tokenizer(output_data, padding="longest", max_length=MAX_OUTPUT, truncation=True).input_ids
labels = torch.tensor(labels)
labels[labels == tokenizer.pad_token_id] = -100