开源Phi-3-mini-128k-instruct-graph模型 - 从通用文本中精准提取实体关系！

首页

Phi 3 Mini 128k Instruct Graph

由 EmergentMethods 开发

Phi-3-mini-128k-instruct-graph 是微软 Phi-3-mini-128k-instruct 的微调版本，专门用于从通用文本数据中提取实体关系。

知识图谱

Transformers

英语#实体关系抽取 #结构化JSON输出 #高吞吐量处理

下载量 117

发布时间 : 7/20/2024

模型简介

该模型旨在生成表示通用文本数据中实体关系的结构化 JSON 数据，适用于信息检索、趋势分析和预测建模等任务。

模型特点

高效实体关系提取

专门优化用于从文本中提取实体及其关系，生成结构化 JSON 数据。

与 GPT-4 相当的质量

在生成实体关系图方面达到与 GPT-4 相当的质量和准确性。

高吞吐量处理

优化设计用于大规模文本数据处理，提高处理效率。

模型能力

实体识别

关系提取

结构化数据生成

大规模文本处理

使用案例

信息检索

增强文本数据库检索

通过提取实体关系增强各种文本数据库中的信息检索能力。

趋势分析

高级预测建模

对各种文本来源进行趋势分析的高级预测建模。

内容分析

文档关系探索

探索不同类型文档中的时间关系和演变叙事。

🚀 Phi-3-mini-128k-instruct-graph 模型卡片

本模型是微软 Phi-3-mini-128k-instruct 的微调版本，专门用于从通用文本数据中提取实体关系。它旨在在生成实体关系图方面达到与 GPT-4 相当的质量和准确性，同时提高大规模处理的效率。

📚 详细文档

模型详情

属性	详情
开发者	Emergent Methods
资助方	Emergent Methods
共享方	Emergent Methods
模型类型	microsoft/phi-3-mini-128k-instruct（微调版）
语言	英语
许可证	知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议
微调基础模型	microsoft/phi-3-mini-128k-instruct

更多信息，请查看我们的博客文章： 📰 博客

用途

本模型旨在生成表示通用文本数据中实体关系的结构化 JSON 数据。它可用于：

增强各种文本数据库中的信息检索。
探索不同类型文档中的时间关系和演变叙事。
对各种文本来源进行趋势分析的高级预测建模。

该模型特别适用于需要高吞吐量处理大量文本数据的应用程序，如内容聚合平台、研究数据库和综合文本分析系统。

偏差、风险和局限性

尽管数据集的目标是减少偏差并提高多样性，但它仍然偏向西方语言和国家。这种局限性源于 Llama2 在翻译和摘要生成方面的能力。此外，由于使用 Llama2 对开放网络文章进行摘要，Llama2 训练数据中存在的任何偏差也将存在于该数据集中。而且，Microsoft Phi-3 中存在的任何偏差也将存在于当前数据集中。

训练详情

训练数据：来自 AskNews 的 7000 多篇故事和更新，经过精心策划以避免主题重叠。
训练过程：使用 Transformers 库、SFTTrainer、PEFT 和 QLoRA 进行微调。

评估结果

与 GPT-4o（基准真值）、Claude Sonnet 3.5 和基础 Phi-3 模型相比：

指标	Phi-3 微调版	Claude Sonnet 3.5	Phi-3（基础版）
节点相似度	0.78	0.64	0.64
边相似度	0.49	0.41	0.30
JSON 一致性	0.99	0.97	0.96
JSON 相似度	0.75	0.67	0.63

环境影响

硬件类型：1x A100 SXM
使用时长：3 小时
碳排放：0.44 千克（根据机器学习影响计算器）

💻 使用示例

基础用法

以下代码片段展示了如何在 GPU 上快速运行该模型：

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 
model = AutoModelForCausalLM.from_pretrained( 
    "EmergentMethods/Phi-3-mini-128k-instruct-graph",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("EmergentMethods/Phi-3-mini-128k-instruct-graph") 

messages = [ 
    {"role": "system", "content": """
A chat between a curious user and an artificial intelligence Assistant. The Assistant is an expert at identifying entities and relationships in text. The Assistant responds in JSON output only.

The User provides text in the format:

-------Text begin-------
<User provided text>
-------Text end-------

The Assistant follows the following steps before replying to the User:

1. **identify the most important entities** The Assistant identifies the most important entities in the text. These entities are listed in the JSON output under the key "nodes", they follow the structure of a list of dictionaries where each dict is:

"nodes":[{"id": <entity N>, "type": <type>, "detailed_type": <detailed type>}, ...]

where "type": <type> is a broad categorization of the entity. "detailed type": <detailed_type>  is a very descriptive categorization of the entity.

2. **determine relationships** The Assistant uses the text between -------Text begin------- and -------Text end------- to determine the relationships between the entities identified in the "nodes" list defined above. These relationships are called "edges" and they follow the structure of:

"edges":[{"from": <entity 1>, "to": <entity 2>, "label": <relationship>}, ...]

The <entity N> must correspond to the "id" of an entity in the "nodes" list.

The Assistant never repeats the same node twice. The Assistant never repeats the same edge twice.
The Assistant responds to the User in JSON only, according to the following JSON schema:

{"type":"object","properties":{"nodes":{"type":"array","items":{"type":"object","properties":{"id":{"type":"string"},"type":{"type":"string"},"detailed_type":{"type":"string"}},"required":["id","type","detailed_type"],"additionalProperties":false}},"edges":{"type":"array","items":{"type":"object","properties":{"from":{"type":"string"},"to":{"type":"string"},"label":{"type":"string"}},"required":["from","to","label"],"additionalProperties":false}}},"required":["nodes","edges"],"additionalProperties":false}
     """}, 
    {"role": "user", "content": """
-------Text begin-------
OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its mission is to develop "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work".[4] As a leading organization in the ongoing AI boom,[5] OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora.[6][7] Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.
-------Text end-------
"""}
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

# Output:

# {
#     "nodes": [
#         {
#             "id": "OpenAI",
#             "type": "organization",
#             "detailed_type": "ai research organization"
#         },
#         {
#             "id": "GPT family",
#             "type": "technology",
#             "detailed_type": "large language models"
#         },
#         {
#             "id": "DALL-E series",
#             "type": "technology",
#             "detailed_type": "text-to-image models"
#         },
#         {
#             "id": "Sora",
#             "type": "technology",
#             "detailed_type": "text-to-video model"
#         },
#         {
#             "id": "ChatGPT",
#             "type": "technology",
#             "detailed_type": "generative ai"
#         },
#         {
#             "id": "San Francisco",
#             "type": "location",
#             "detailed_type": "city"
#         },
#         {
#             "id": "California",
#             "type": "location",
#             "detailed_type": "state"
#         },
#         {
#             "id": "December 2015",
#             "type": "date",
#             "detailed_type": "foundation date"
#         },
#         {
#             "id": "November 2022",
#             "type": "date",
#             "detailed_type": "release date"
#         }
#     ],
#     "edges": [
#         {
#             "from": "OpenAI",
#             "to": "San Francisco",
#             "label": "headquartered in"
#         },
#         {
#             "from": "San Francisco",
#             "to": "California",
#             "label": "located in"
#         },
#         {
#             "from": "OpenAI",
#             "to": "December 2015",
#             "label": "founded in"
#         },
#         {
#             "from": "OpenAI",
#             "to": "GPT family",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "DALL-E series",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "Sora",
#             "label": "developed"
#         },
#         {
#             "from": "OpenAI",
#             "to": "ChatGPT",
#             "label": "released"
#         },
#         {
#             "from": "ChatGPT",
#             "to": "November 2022",
#             "label": "released in"
#         }
#     ]
# }