diagram2graph - adapters开源视觉语言模型，免费从图像提取结构化数据转知识图谱

首页

Diagram2graph Adapters

由 zackriya 开发

一个专注于从图像中提取结构化数据(JSON)的视觉语言模型，特别擅长识别图表中的节点、边及其子属性，将视觉信息表示为知识图谱。

图像生成文本

Safetensors

开源协议:Apache-2.0 #图表转JSON #知识图谱构建 #视觉结构化提取

下载量 52

发布时间 : 3/14/2025

模型简介

该模型基于Qwen2.5-VL-3B-Instruct微调，专注于从流程和流程图视觉表示中提取结构化数据，输出为JSON格式。

模型特点

结构化数据提取

能够从图表图像中精确提取节点、边及其属性，输出为结构化的JSON格式

LoRA微调优化

采用基于LoRA的优化技术进行微调，提高模型性能

知识图谱表示

将视觉信息转换为知识图谱形式，便于后续分析和处理

模型能力

图表图像分析

结构化数据提取

JSON格式输出

知识图谱构建

使用案例

图表分析

流程图解析

从流程图中提取节点和边的结构化信息

节点检测提升14%，边检测提升23%

BPMN分析

支持BPMN图表的自动化处理和分析

文档处理

自动化文档处理

从文档中的图表提取结构化数据

🚀 🖼️🔗 图表转知识图谱模型

本模型是一个研究驱动的项目，由 Zackariya Solution 实习期间开发。它专注于从图像中提取结构化数据（JSON），特别是节点、边及其子属性，将视觉信息表示为知识图谱。

🚀 注意：本模型仅用于学习目的，不用于生产应用。提取的结构化数据可能会根据项目需求有所不同。

🚀 快速开始

安装依赖

%pip install -q "transformers>=4.49.0" accelerate datasets "qwen-vl-utils[decord]==0.0.8"

运行推理代码

import os
import PIL
import torch
from qwen_vl_utils import process_vision_info
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor


MODEL_ID="zackriya/diagram2graph-adapters"
MAX_PIXELS = 1280 * 28 * 28
MIN_PIXELS = 256 * 28 * 28


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

processor = Qwen2_5_VLProcessor.from_pretrained(
    MODEL_ID,
    min_pixels=MIN_PIXELS,
    max_pixels=MAX_PIXELS
)


SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
The diagram includes details such as nodes and edges. each of them have their own attributes.
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""

def run_inference(image):
    messages= [
        {
            "role": "system",
            "content": [{"type": "text", "text": SYSTEM_MESSAGE}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    # this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path
                    "image": image,
                },
                {
                    "type": "text",
                    "text": "Extract data in JSON format, Only give the JSON",
                },
            ],
        },
    ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        return_tensors="pt",
    )
    inputs = inputs.to('cuda')

    generated_ids = model.generate(**inputs, max_new_tokens=512)
    generated_ids_trimmed = [
        out_ids[len(in_ids):]
        for in_ids, out_ids
        in zip(inputs.input_ids, generated_ids)
    ]

    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    return output_text
image = eval_dataset[9]['image'] # PIL image
# `image` could be URL or relative path to the image
output = run_inference(image)

# JSON loading
import json
json.loads(output[0])

✨ 主要特性

专注于从图像中提取结构化数据（JSON），将视觉信息表示为知识图谱。
可用于图表转知识图谱的实验和理解AI驱动的图像结构化提取。

📋 模型详情

属性	详情
开发团队	Zackariya Solution 实习团队（Mohammed Safvan）
微调基础模型	`Qwen/Qwen2.5-VL-3B-Instruct`
许可证	Apache 2.0
语言	多语言（专注于结构化提取）
模型类型	视觉语言Transformer（PEFT微调）

🎯 使用场景

✅ 直接使用

进行图表转知识图谱的实验 📊
理解图像的AI驱动结构化提取

🚀 下游使用（潜在）

增强 BPMN/流程图 分析 🏗️
支持 自动化文档处理 📄

❌ 不适用场景

不适用于实际生产部署 ⚠️
可能无法在所有图表类型上很好地泛化

🏗️ 训练详情

数据集：内部整理的图表数据集 🖼️
微调方法：基于LoRA的优化 ⚡
精度：bf16混合精度训练 🎯

📈 评估

评估指标

指标：F1分数 🏆
局限性：可能在处理复杂、密集的图表时遇到困难 ⚠️

评估结果

节点检测提高了14%
边检测提高了23%

样本	(基础)节点F1	(微调)节点F1	(基础)边F1	(微调)边F1
image_sample_1	0.46	1.0	0.59	0.71
image_sample_2	0.67	0.57	0.25	0.25
image_sample_3	1.0	1.0	0.25	0.75
image_sample_4	0.5	0.83	0.15	0.62
image_sample_5	0.72	0.78	0.0	0.48
image_sample_6	0.75	0.75	0.29	0.67
image_sample_7	0.6	1.0	1.0	1.0
image_sample_8	0.6	1.0	1.0	1.0
image_sample_9	1.0	1.0	0.55	0.77
image_sample_10	0.67	0.8	0.0	1.0
image_sample_11	0.8	0.8	0.5	1.0
image_sample_12	0.67	1.0	0.62	0.75
image_sample_13	1.0	1.0	0.73	0.67
image_sample_14	0.74	0.95	0.56	0.67
image_sample_15	0.86	0.71	0.67	0.67
image_sample_16	0.75	1.0	0.8	0.75
image_sample_17	0.8	1.0	0.63	0.73
image_sample_18	0.83	0.83	0.33	0.43
image_sample_19	0.75	0.8	0.06	0.22
image_sample_20	0.81	1.0	0.23	0.75
平均值	0.749	0.891	0.4605	0.6945