Propositionizer-Wiki-Flan-T5-Large开源命题分割模型 - 免费将文本拆成独立命题单元

首页

Propositionizer Wiki Flan T5 Large

由 chentong00 开发

该模型是一个基于Flan-T5-Large的命题分割模型，用于将文本内容分解为独立的命题单元。

大型语言模型

Transformers

开源协议:Apache-2.0 #文本命题分割 #知识密集检索 #JSON结构化输出

下载量 892

发布时间 : 11/11/2023

模型简介

该模型主要用于将复杂的文本段落分解为简短的、独立的命题单元，便于信息检索和分析。

模型特点

文本命题分割

能够将复杂文本内容分解为独立的命题单元，便于后续处理和分析。

结构化输出

输出为JSON格式的命题列表，便于程序处理。

多级输入支持

支持标题、章节和内容的多级输入，提高分割准确性。

模型能力

文本分割

信息提取

结构化输出

使用案例

信息检索

维基百科内容分析

将维基百科文章分解为独立命题，便于建立更细粒度的检索系统。

提高检索系统的精确度和召回率

知识图谱构建

知识单元提取

从文本中提取独立的知识单元，用于构建知识图谱。

提高知识图谱的构建效率和质量

🚀 命题分割模型

本模型是由陈等人在2023年发表的论文"Dense X Retrieval: What Retrieval Granularity Should We Use?"中提出的命题分割模型。该模型能够将输入的文本内容分解为多个命题，以JSON格式输出。

🚀 快速开始

本模型的输入提示格式为：Title: {标题}. Section: {章节}. Content: {内容}，输出为JSON格式的命题列表。

例如，使用该模型分解以下段落：

Title: Leaning Tower of Pisa. Section: . Content: Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees, but the tower now leans at about 3.99 degrees. This means the top of the tower is displaced horizontally 3.9 meters (12 ft 10 in) from the center.

输出将是：

["Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees.", "Leaning Tower of Pisa now leans at about 3.99 degrees.", "The top of Leaning Tower of Pisa is displaced horizontally 3.9 meters (12 ft 10 in) from the center."]

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import json

model_name = "chentong00/propositionizer-wiki-flan-t5-large"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

title = "Leaning Tower of Pisa"
section = ""
content = "Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees, but the tower now leans at about 3.99 degrees. This means the top of the tower is displaced horizontally 3.9 meters (12 ft 10 in) from the center."

input_text = f"Title: {title}. Section: {section}. Content: {content}"

input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids.to(device), max_new_tokens=512).cpu()

output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
try:
    prop_list = json.loads(output_text)
except:
    prop_list = []
    print("[ERROR] Failed to parse output text as JSON.")
print(json.dumps(prop_list, indent=2))

预期输出：

[
  "Prior to restoration work performed between 1990 and 2001, Leaning Tower of Pisa leaned at an angle of 5.5 degrees.",
  "Leaning Tower of Pisa now leans at about 3.99 degrees.",
  "The top of Leaning Tower of Pisa is displaced horizontally 3.9 meters (12 ft 10 in) from the center."
]

📄 许可证

本项目采用Apache-2.0许可证。

📚 引用

如果您在研究中使用了本模型，请引用以下论文：

@article{chen2023densex,
  title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
  author={Tong Chen and Hongwei Wang and Sihao Chen and Wenhao Yu and Kaixin Ma and Xinran Zhao and Hongming Zhang and Dong Yu},
  journal={arXiv preprint arXiv:2312.06648},
  year={2023},
  URL = {https://arxiv.org/pdf/2312.06648.pdf}
}