chatgpt_paraphraser_on_T5_base开源文本复述模型

首页

Chatgpt Paraphraser On T5 Base

由 humarin 开发

基于T5-base架构训练的文本复述模型，能够生成高质量复述文本，号称Hugging Face平台最佳复述模型之一

文本生成

Transformers

英语开源协议:Openrail #多句式复述 #高多样性生成 #T5架构优化

下载量 115.08k

发布时间 : 3/17/2023

模型简介

该模型通过迁移学习技术模仿ChatGPT的复述能力，整合Quora、SQUAD 2.0和CNN新闻数据集训练而成，主要用于文本改写和复述任务

模型特点

多源数据集训练

整合Quora复述问题、SQUAD 2.0和CNN新闻三大高质量数据集

高级生成控制

支持beam搜索、多样性惩罚等高级文本生成参数控制

高质量复述

通过迁移学习模仿ChatGPT的复述能力，生成语义保持的多样化表达

模型能力

文本复述

语义保持改写

多样化表达生成

使用案例

内容创作

旅游指南改写

对旅游景点描述进行多样化表达

生成5种不同表达方式的景点介绍

新闻摘要改写

对新闻内容进行非重复性复述

保持原意的多种表达版本

教育辅助

学习材料多样化

为同一知识点生成不同表述版本

帮助学生多角度理解概念

🚀 文本改写模型

本项目基于T5-base模型，通过迁移学习使模型能够像ChatGPT一样生成高质量的改写文本，是Hugging Face上优秀的文本改写模型之一。

🚀 快速开始

本模型在 ChatGPT释义数据集上进行训练。该数据集基于 Quora释义问题、SQUAD 2.0 以及 CNN新闻数据集构建。

部署示例

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")

model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def paraphrase(
    question,
    num_beams=5,
    num_beam_groups=5,
    num_return_sequences=5,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=128
):
    input_ids = tokenizer(
        f'paraphrase: {question}',
        return_tensors="pt", padding="longest",
        max_length=max_length,
        truncation=True,
    ).input_ids.to(device)
    
    outputs = model.generate(
        input_ids, temperature=temperature, repetition_penalty=repetition_penalty,
        num_return_sequences=num_return_sequences, no_repeat_ngram_size=no_repeat_ngram_size,
        num_beams=num_beams, num_beam_groups=num_beam_groups,
        max_length=max_length, diversity_penalty=diversity_penalty
    )

    res = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return res

💻 使用示例

基础用法

# 输入示例
text = 'What are the best places to see in New York?'
paraphrase(text)

# 输出示例
['What are some must-see places in New York?',
 'Can you suggest some must-see spots in New York?',
 'Where should one go to experience the best NYC has to offer?',
 'Which places should I visit in New York?',
 'What are the top destinations to explore in New York?']

高级用法

# 输入示例
text = "Rammstein's album Mutter was recorded in the south of France in May and June 2000, and mixed in Stockholm in October of that year."
paraphrase(text)

# 输出示例
['In May and June 2000, Rammstein travelled to the south of France to record his album Mutter, which was mixed in Stockholm in October of that year.',
 'The album Mutter by Rammstein was recorded in the south of France during May and June 2000, with mixing taking place in Stockholm in October of that year.',
 'The album Mutter by Rammstein was recorded in the south of France during May and June 2000, with mixing taking place in Stockholm in October of that year. It',
 'Mutter, the album released by Rammstein, was recorded in southern France during May and June 2000, with mixing taking place between October and September.',
 'In May and June 2000, Rammstein recorded his album Mutter in the south of France, with the mix being made at Stockholm during October.']

🔧 技术细节

训练参数

epochs = 5
batch_size = 64
max_length = 128
lr = 5e-5
batches_qty = 196465
betas = (0.9, 0.999)
eps = 1e-08

BibTeX引用

@inproceedings{chatgpt_paraphraser,
  author={Vladimir Vorobev, Maxim Kuznetsov},
  title={A paraphrasing model based on ChatGPT paraphrases},
  year={2023}
}

📄 许可证

本项目采用OpenRail许可证。

📦 模型信息

属性	详情
模型类型	基于T5-base的文本改写模型
训练数据	基于ChatGPT释义数据集，该数据集基于Quora释义问题、SQUAD 2.0以及CNN新闻数据集构建
推理参数	束搜索数量：5；束搜索组数量：5；返回序列数量：5；重复惩罚：10.01；多样性惩罚：3.01；无重复n-gram大小：2；温度：0.7；最大长度：128