Triplex开源模型 - 免费部署，高效从非结构化数据构建知识图谱降本98%

首页

Triplex

由 SciPhi 开发

Triplex是SciPhi.AI基于Phi3-3.8B微调的模型，专为从非结构化数据构建知识图谱设计，可将知识图谱创建成本降低98%。

知识图谱 #低成本知识图谱 #三元组抽取 #非结构化数据处理

下载量 1,808

发布时间 : 7/10/2024

模型简介

Triplex是一个专为知识图谱构建设计的大语言模型，能够从文本或其他数据源提取三元组（主语、谓语、宾语构成的简单陈述），显著降低知识图谱构建成本。

模型特点

低成本知识图谱构建

以GPT-4六十分之一的价格实现更优性能，可将知识图谱创建成本降低98%

高效三元组提取

能够从非结构化数据中高效提取主语-谓语-宾语构成的三元组

本地部署支持

通过SciPhi的R2R框架支持本地知识图谱构建

模型能力

命名实体识别

关系提取

知识图谱构建

文本理解

结构化信息提取

使用案例

知识管理

企业知识库构建

从企业文档中提取结构化知识构建知识图谱

降低知识管理成本，提高信息检索效率

智能搜索

增强RAG系统

为检索增强生成系统提供结构化知识支持

提高搜索准确性和相关性

🚀 Triplex：用于知识图谱构建的SOTA大语言模型

Triplex是一款专为知识图谱构建而设计的大语言模型，由SciPhi.AI开发。它基于Phi3 - 3.8B微调而来，能够从非结构化数据中高效创建知识图谱。在知识图谱构建成本高昂的当下，如微软的Graph RAG虽能增强RAG方法，但构建成本不菲。而Triplex可将知识图谱创建成本降低98%，以GPT - 4六十分之一的成本实现更优性能，还能借助SciPhi的R2R实现本地图谱构建。

🚀 快速开始

资源链接

博客：https://www.sciphi.ai/blog/triplex
演示：kg.sciphi.ai
使用手册：https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph

Python代码示例

import json
from transformers import AutoModelForCausalLM, AutoTokenizer

def triplextract(model, tokenizer, text, entity_types, predicates):

    input_format = """Perform Named Entity Recognition (NER) and extract knowledge graph triplets from the text. NER identifies named entities of given entity types, and triple extraction identifies relationships between entities using specified predicates.
      
        **Entity Types:**
        {entity_types}
        
        **Predicates:**
        {predicates}
        
        **Text:**
        {text}
        """

    message = input_format.format(
                entity_types = json.dumps({"entity_types": entity_types}),
                predicates = json.dumps({"predicates": predicates}),
                text = text)

    messages = [{'role': 'user', 'content': message}]
    input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt = True, return_tensors="pt").to("cuda")
    output = tokenizer.decode(model.generate(input_ids=input_ids, max_length=2048)[0], skip_special_tokens=True)
    return output

model = AutoModelForCausalLM.from_pretrained("sciphi/triplex", trust_remote_code=True).to('cuda').eval()
tokenizer = AutoTokenizer.from_pretrained("sciphi/triplex", trust_remote_code=True)

entity_types = [ "LOCATION", "POSITION", "DATE", "CITY", "COUNTRY", "NUMBER" ]
predicates = [ "POPULATION", "AREA" ]
text = """
San Francisco,[24] officially the City and County of San Francisco, is a commercial, financial, and cultural center in Northern California. 

With a population of 808,437 residents as of 2022, San Francisco is the fourth most populous city in the U.S. state of California behind Los Angeles, San Diego, and San Jose.
"""

prediction = triplextract(model, tokenizer, text, entity_types, predicates)
print(prediction)

✨ 主要特性

成本大幅降低：将知识图谱创建成本降低98%，以GPT - 4六十分之一的成本实现更优性能。
本地构建能力：借助SciPhi的R2R实现本地知识图谱构建。
高效提取三元组：能够从文本或其他数据源中提取三元组（由主语、谓语和宾语组成的简单陈述）。

📊 基准测试

image/png

📄 许可证

模型权重采用CC - BY - NC - SA - 4.0许可证。不过，对于最近12个月内总收入低于500万美元的组织，我们将免除这些限制。如果您想去除GPL许可证要求（双重许可）和/或在超过收入限制的情况下商业使用这些权重，请通过founders@sciphi.ai联系我们的团队。

📖 引用

@misc{pimpalgaonkar2024triplex,
author = {Pimpalgaonkar, Shreyas and Tremelling, Nolan and Colegrove, Owen},
title = {Triplex: a SOTA LLM for knowledge graph construction},
year = {2024},
url = {https://huggingface.co/sciphi/triplex}
}