DeepSeek-R1-Qwen2.5-1.5b-SFT-R1开源模型 - 免费实现非结构化文本到JSON结构化转换

首页

Deepseek R1 Qwen2.5 1.5b SFT R1 JSON Unstructured To Structured

由 MasterControlAIML 开发

本模型基于Qwen2架构，专门用于将非结构化文本按照预定义的JSON模式转换为结构化输出。

大型语言模型

Transformers

英语开源协议:Apache-2.0 #文本结构化转换 #JSON模式映射 #文档解析

下载量 399

发布时间 : 2/7/2025

模型简介

该模型经过微调，能够将非结构化文本（如手册、问答文档）映射为结构化JSON格式，支持基于给定JSON模式的分层数据提取。

模型特点

高效结构化转换

能够将非结构化文本准确映射到预定义的JSON模式中

层级数据提取

支持具有层级关系的复杂JSON模式提取

快速推理

利用Unsloth库实现高效模型推理

模型能力

非结构化文本解析

JSON模式匹配

层级数据提取

结构化输出生成

使用案例

文档处理

质量手册结构化

将非结构化的质量保证手册转换为符合特定模式的JSON格式

生成标准化的文档结构，便于后续处理和分析

数据标准化

问答数据转换

将问答形式的非结构化数据转换为结构化JSON

实现问答数据的标准化存储和检索

🚀 MasterControlAIML R1-Qwen2.5-1.5b SFT R1 JSON非结构化转结构化模型

本项目提供了一个经过微调的Qwen2模型，该模型针对将非结构化文本转换为符合预定义模式的结构化JSON输出进行了优化。模型基于基础模型 MasterControlAIML/DeepSeek-R1-Strategy-Qwen-2.5-1.5b-Unstructured-To-Structured 进行微调，并采用LoRA技术实现高效适配。

关键亮点：

开发者： bhaviktheslider

许可证： Apache-2.0

微调基础： MasterControlAIML/DeepSeek-R1-Strategy-Qwen-2.5-1.5b-Unstructured-To-Structured

加速训练： 使用 Unsloth 和Hugging Face的TRL库，训练速度提升2倍。

🚀 快速开始

本模型适用于需要将非结构化文本（如手册、问答文档）映射为结构化JSON格式的任务。它支持基于给定JSON模式进行分层数据提取，确保生成的输出严格遵循模式定义的结构和规则。

使用Unsloth进行快速推理

Unsloth库可让您快速加载模型并进行推理。以下是一个基本示例：

from unsloth import FastLanguageModel
import torch

MODEL = "MasterControlAIML/DeepSeek-R1-Qwen2.5-1.5b-SFT-R1-JSON-Unstructured-To-Structured"

# 加载模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=False,
)

# 准备推理
FastLanguageModel.for_inference(model)

ALPACA_PROMPT = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}
"""

# 示例指令和提示
instruction = "" (see examples below)
prompt = ALPACA_PROMPT.format(instruction, "")
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=2000)

# 打印生成的文本
print(tokenizer.batch_decode(output, skip_special_tokens=True)[0])

使用Transformers进行推理

您也可以直接使用Hugging Face的Transformers库：

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

MODEL = "MasterControlAIML/DeepSeek-R1-Qwen2.5-1.5b-SFT-R1-JSON-Unstructured-To-Structured"

# 初始化分词器和模型
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float16, device_map="auto")

ALPACA_PROMPT = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}
"""

TEXT = ""(see examples below)
prompt = ALPACA_PROMPT.format(TEXT, "")
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)

with torch.no_grad():
    output_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=2000,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        streamer=text_streamer,
        pad_token_id=tokenizer.pad_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

✨ 主要特性

高效推理： 利用 Unsloth 库实现快速模型推理。
结构化输出： 将文本输入映射到具有分层关系的严格JSON模式中。
灵活集成： 示例代码片段展示了如何使用Unsloth API和Hugging Face的Transformers。
高级提示： 以下提供了使用Alpaca提示模板和LangChain提示模板的详细示例。

📦 安装指南

前提条件

Python： 3.8+
PyTorch： （建议支持CUDA）
所需库： transformers、torch、unsloth、langchain（用于高级用法）

安装命令

使用以下命令安装所需的Python包：

pip install torch transformers unsloth langchain

💻 使用示例

高级数据提取示例（使用LangChain）

from langchain_core.prompts import PromptTemplate

# 示例文本输入，结构略有不同
TEXT1 = """
Quality Assurance Manual Overview

This document provides an introduction to the quality assurance procedures for manufacturing processes.

## Introduction to Manufacturing Processes

This section covers the basics of manufacturing processes. It includes definitions and a brief history.

| Name                | Description               |
|---------------------|---------------------------|
| Process 1           | Initial process in制造过程。|

### Detailed Process 1

This process refines material to meet specific standards.

| Parameter | Range    |
|-----------|----------|
| Temperature | 200-300°C |

**Operational Parameters**

The operational parameters are critical for the success of this process.

| Setting | Class      |
|---------|------------|
| Critical | High Precision |
"""

# 不同的JSON模式示例
SCHEMA1 = """
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "id": {
      "type": "string",
      "description": "Dot-separated integers representing the hierarchical id of the element."
    },
    "title": {
      "type": "string",
      "description": "Descriptive title of the section or component."
    },
    "level": {
      "type": "integer",
      "description": "Hierarchy level starting from 0 for the root."
    },
    "level_type": {
      "type": "string",
      "enum": ["ROOT", "SECTION", "SUBSECTION", "DETAIL_N"],
      "description": "Type of the hierarchal level."
    },
    "component": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "idc": {
            "type": "integer",
            "description": "Sequential unique component ID."
          },
          "component_type": {
            "type": "string",
            "enum": ["PARAGRAPH", "TABLE", "CALCULATION", "CHECKBOX"],
            "description": "Type of the component."
          },
          "metadata": {
            "type": "string",
            "description": "Additional metadata token (may be <title>, <note>, or <overview>)."
          },
          "properties": {
            "type": "object",
            "properties": {
              "variables": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "idx": {
                      "type": "string",
                      "description": "Unique identifier, X.Y (X represents row, Y represents column for Tables, 0 otherwise)."
                    },
                    "name": {
                      "type": "string",
                      "description": "Attribute name."
                    },
                    "value": {
                      "type": "string",
                      "description": "Attribute value."
                    },
                    "unit": {
                      "type": "string",
                      "description": "Optional unit."
                    },
                    "metrics": {
                      "type": "boolean",
                      "description": "Flag indicating if variable is a metric."
                    },
                    "formula": {
                      "type": "boolean",
                      "description": "Flag indicating if variable has an equation."
                    }
                  }
                }
              },
              "content": {
                "type": "array",
                "items": {
                  "type": "string"
                },
                "description": "Text content (e.g., notes, MCQs, questions, points, etc.)."
              }
            }
          }
        }
      }
    },
    "children": {
      "type": "array",
      "items": {
        "$ref": "#"
      },
      "description": "Nested elements."
    }
  },
  "required": ["id", "title", "level", "level_type", "component", "children"]
}
"""

# LangChain提示，引导模型
SYSTEM_PROMPT = """
### Role:
You are an expert data extractor specializing in mapping hierarchical text data into a given JSON Schema.

### DATA INPUT:
- **Text:** ```{TEXT}```  
- **Blank JSON Schema:** ```{SCHEMA}```  

### TASK REQUIREMENT:
1. Analyze the given text and map all relevant information strictly into the provided JSON Schema.
2. Provide your output in **two mandatory sections**:
   - **`<answer>`:** The filled JSON object  
   - **`<think>`:** Reasoning for the mapping decisions  

### OUTPUT STRUCTURE:

`<think> /* Explanation of mapping logic */ </think>`
`<answer> /* Completed JSON Object */ </answer>`



### STRICT RULES FOR GENERATING OUTPUT:
1. **Both Tags Required:**  
   - Always provide both the `<think>` and the `<answer>` sections.  
   - If reasoning is minimal, state: "Direct mapping from text to schema."
2. **JSON Schema Mapping:**  
   - Strictly map the text data to the given JSON Schema without modification or omissions.
3. **Hierarchy Preservation:**  
   - Maintain proper parent-child relationships and follow the schema's hierarchical structure.
4. **Correct Mapping of Attributes:**  
   - Map key attributes, including `id`, `idc`, `idx`, `level_type`, and `component_type`.
5. **JSON Format Compliance:**  
   - Escape quotes (`\"`), replace newlines with `\\n`, avoid trailing commas, and use double quotes exclusively.
6. **Step-by-Step Reasoning:**  
   - Explain your reasoning within the `<think>` tag.

### IMPORTANT:
If either the `<think>` or `<answer>` tags is missing, the response will be considered incomplete.
"""

from langchain_core.prompts import PromptTemplate
system_prompt_template = PromptTemplate(template=SYSTEM_PROMPT, input_variables=["TEXT", "SCHEMA"])

# 使用选定的文本和JSON模式格式化提示
system_prompt_str = system_prompt_template.format(TEXT=TEXT1, SCHEMA=SCHEMA1)

# 现在，将 `system_prompt_str` 传入模型推理管道（使用Unsloth、Transformers等）
# 以获取包含推理和映射后的JSON对象的输出。

📚 详细文档

示例响应及思维链解释

以下是两个示例输出，包含 思维链解释 (<think>) 和对应的 填充后的JSON模式 (<answer>)。这些示例展示了模型应如何阐述推理过程并生成最终的映射模式。

示例响应1

<think>
思维链解释：
为了根据文本填充空白模式并使其与填充后的模式完全匹配，我将从理解每个部分的含义开始。

首先，观察空白模式，发现它有需要填充实际内容的占位符。目标是用从文本中提取的适当数据替换这些占位符，使完成后的模式与填充后的模式完全一致。

我注意到两个模式具有相似的结构，只是组件类型不同。第一步是确定每种组件类型如何映射回文本中的元素。

例如，在“关键指标”下的“简介”部分，有一个表格列出了两个变量：缺陷率为0.5%，良率为98%。在填充后的模式中，这些成为表格组件的“变量”数组中的条目。因此，我将从文本的相应部分提取这些值，并将它们插入到空白模式中的相应位置。

接下来，在标题为“质量保证检查表”的子部分中，有关于确认缺陷是否被识别和纠正以及是否定期进行审核的复选框。再次查看文本，我看到了描述是否进行了特定检查的项目符号，例如确认是否发现了缺陷，然后记录未确认的事项。由此，我可以推断每个复选框对应一个确认语句。

然后是标题为“标准概述”的详细部分。这里有另一个比较ISO 9001和ASQ认证质量管理体系的表格。在文本中，在“质量保证标准比较”下，提到了检查这两个选项，并明确指示了真假状态。因此，我将通过添加相同的比较详细信息行来在空白模式中复制此结构。

在整个过程中，我的重点是将每个文本信息匹配到模式中的正确位置，不改变指定之外的任何其他部分。通过根据文本内容系统地识别和填充每个占位符，最终结果应该与填充后的模式完全一致，因为填充版本中的每个元素在提取过程中都得到了考虑。
</think>

<answer>{
    "id": "0.0",
    "title": "Quality Assurance Manual for Manufacturing Process",
    "level": 0,
    "level_type": "ROOT",
    "component": [
        {
            "idc": 0,
            "component_type": "PARAGRAPH",
            "metadata": "<overview>This manual covers the QA procedures and guidelines for various manufacturing processes.</overview>",
            "properties": {
                "variables": [],
                "content": []
            }
        }
    ],
    "children": [
        {
            "id": "1.0",
            "title": "Introduction to Quality Assurance",
            "level": 1,
            "level_type": "SECTION",
            "component": [
                {
                    "idc": 0,
                    "component_type": "PARAGRAPH",
                    "metadata": "",
                    "properties": {
                        "variables": [],
                        "content": [
                            "Quality Assurance (QA) involves systematic activities designed to ensure consistent production of reliable and safe products.",
                            "It aims to reduce variability in the final product and improve overall efficiency."
                        ]
                    }
                },
                {
                    "idc": 1,
                    "component_type": "TABLE",
                    "metadata": "",
                    "properties": {
                        "variables": [
                            {
                                "idx": "0.0",
                                "name": "Metric",
                                "value": "Defect Rate",
                                "unit": "%",
                                "metrics": true,
                                "formula": false
                            },
                            {
                                "idx": "0.1",
                                "name": "Target",
                                "value": 0.5,
                                "unit": null,
                                "metrics": true,
                                "formula": false
                            }
                        ],
                        "content": []
                    }
                }
            ],
            "children": [
                {
                    "id": "1.1",
                    "title": "QA in Manufacturing Processes",
                    "level": 2,
                    "level_type": "SUBSECTION",
                    "component": [
                        {
                            "idc": 0,
                            "component_type": "PARAGRAPH",
                            "metadata": "",
                            "properties": {
                                "variables": [],
                                "content": [
                                    "Manufacturing processes require strict adherence to QA procedures to ensure product reliability and safety."
                                ]
                            }
                        },
                        {
                            "idc": 1,
                            "component_type": "CHECKBOX",
                            "metadata": "",
                            "properties": {
                                "variables": [
                                    {
                                        "idx": "0.0",
                                        "name": "Process Control",
                                        "value": true,
                                        "unit": null,
                                        "metrics": false,
                                        "formula": false
                                    },
                                    {
                                        "idx": "0.1",
                                        "name": "Raw Material Inspection",
                                        "value": false,
                                        "unit": null,
                                        "metrics": false,
                                        "formula": false
                                    }
                                ],
                                "content": [
                                    "Check if the following QA procedures are followed:",
                                    "\u2713 Process Control \u2713 Raw Material Inspection"
                                ]
                            }
                        }
                    ],
                    "children": [
                        {
                            "id": "1.1.1",
                            "title": "Standard QA Practices",
                            "level": 3,
                            "level_type": "DETAIL_N",
                            "component": [
                                {
                                    "idc": 0,
                                    "component_type": "PARAGRAPH",
                                    "metadata": "",
                                    "properties": {
                                        "variables": [],
                                        "content": [
                                            "Adherence to ISO 9001 standards helps maintain quality control."
                                        ]
                                    }
                                },
                                {
                                    "idc": 1,
                                    "component_type": "TABLE",
                                    "metadata": "",
                                    "properties": {
                                        "variables": [
                                            {
                                                "idx": "0.0",
                                                "name": "Comparison",
                                                "value": "ISO 9001 vs ASQ-Certified QM",
                                                "unit": null,
                                                "metrics": false,
                                                "formula": false
                                            }
                                        ],
                                        "content": []
                                    }
                                }
                            ],
                            "children": []
                        }
                    ]
                }
            ]
        }
    ]
}</answer>

示例响应2

<think>
思维链解释：
为了根据提供的文本填充空白模式并确保其与填充后的模式完全匹配，我们可以按以下步骤进行：

1. **识别组件**：认识到两个模式都包含具有不同组件类型（表格、段落、复选框）的相似结构。
2. **从文本中提取数据**：
   - 在引言部分找到包含缺陷率和良率等变量的表格。
   - 在子部分中识别关于缺陷识别和定期审核的确认语句。
3. **填充变量**：用提取的变量名称和值（缺陷率 = 0.5%，良率 = 98%）填充“关键指标”下的表格组件。
4. **填充复选框**：将项目符号转换为复选框的布尔值；设置确认以反映调查结果（缺陷发现为真，审核未进行为假）。
5. **添加标准表格**：在详细部分创建第二个表格，比较ISO 9001和ASQ认证的质量管理体系，与原文一致。

通过将每个提取的细节与空白模式中的指定位置精确对齐，确保没有额外或遗漏的数据，生成的模式将与填充后的模式的结构和内容完全匹配。

**交叉验证**：每个操作都直接对应于填充特定字段，没有偏差。因此，遵循这种结构化方法可确保转换后的模式与目标填充后的模式具有相同的结果。
</think>

<answer>{
    "id": "0.0",
    "title": "Quality Assurance Manual",
    "level": 0,
    "level_type": "ROOT",
    "component": [
        {
            "idc": 0
        },
        {
            "idc": 1
        }
    ],
    "children": [
        {
            "id": "1.0",
            "title": "Introduction to Manufacturing Processes",
            "level": 1,
            "level_type": "SECTION",
            "component": [],
            "children": [
                {
                    "id": "1.1",
                    "title": "Detailed Process 1",
                    "level": 2,
                    "level_type": "SUBSECTION",
                    "component": [],
                    "children": []
                },
                {
                    "id": "1.2",
                    "title": "Operational Details",
                    "level": 2,
                    "level_type": "SUBSECTION",
                    "component": [
                        {
                            "idc": 0,
                            "component_type": "PARAGRAPH",
                            "metadata": "<overview>This section outlines the operational aspects of Process 1.</overview>",
                            "properties": {}
                        }
                    ],
                    "children": []
                }
            ]
        },
        {
            "id": "2.0",
            "title": "Advanced Manufacturing Techniques",
            "level": 1,
            "level_type": "SECTION",
            "component": [],
            "children": [
                {
                    "id": "2.1",
                    "title": "High-Temperature Processing",
                    "level": 2,
                    "level_type": "SUBSECTION",
                    "component": [
                        {
                            "idc": 0,
                            "component_type": "TABLE",
                            "metadata": "<title>Table of Specifications</title>",
                            "properties": {
                                "variables": [
                                    {
                                        "idx": "0.0",
                                        "name": "Parameter",
                                        "value": "Temperature",
                                        "unit": "\u00b0C",
                                        "metrics": true,
                                        "formula": false
                                    },
                                    {
                                        "idx": "1.0",
                                        "name": "Range",
                                        "value": "[200, 300]",
                                        "unit": null,
                                        "metrics": false,
                                        "formula": false
                                    }
                                ],
                                "content": [
                                    "Note: Ensure temperature control within specified range."
                                ]
                            }
                        }
                    ],
                    "children": []
                }
            ]
        }
    ]
}</answer>