qwen-vl-2.5-3B-finetuned-cheque开源视觉语言模型，免费提取支票金融信息！

首页

Qwen Vl 2.5 3B Finetuned Cheque

由 AJNG 开发

一款视觉语言模型，专门用于从支票图像中提取结构化的金融信息，生成包含支票号码、收款人、金额和签发日期等关键信息的JSON格式输出。

图像生成文本

Transformers

英语#支票信息提取 #结构化JSON输出 #金融文档处理

下载量 170

发布时间 : 2/18/2025

模型简介

该模型是基于Qwen2.5-VL-3B-Instruct微调的视觉语言模型，专注于支票图像处理，能够准确提取金融信息并生成结构化JSON输出。

模型特点

针对性优化

基于个人支票数据集微调，专门用于从支票图像中提取结构化的金融信息

结构化输出

处理支票图像后，生成包含支票号码、收款人、金额和签发日期等关键信息的JSON格式输出

多领域应用

可应用于银行金融服务、会计和工资系统、AI OCR管道以及企业文档管理等多个领域

高效微调

使用LoRA（低秩适应）技术进行微调，减少内存开销

模型能力

支票图像分析

金融信息提取

结构化JSON生成

视觉语言理解

使用案例

银行和金融服务

自动化支票验证

自动验证支票信息，提高处理效率

减少人工验证时间

支票处理自动化

批量处理支票图像，提取关键信息

提高处理速度和准确性

会计和工资系统

金融记录保存

自动提取支票信息用于会计记录

减少人工录入错误

AI OCR管道

增强传统OCR系统

通过结构化输出增强传统OCR系统的功能

提供更丰富的输出信息

企业文档管理

金融数据提取

从扫描的支票中自动提取金融数据

简化文档管理流程

🚀 基于个人支票数据集微调的Qwen2.5-VL-3B-Instruct模型

本模型是一款视觉语言模型（VLM），专门用于从支票图像中提取结构化的金融信息。它能够处理支票图像，并生成包含支票号码、收款人、金额和签发日期等关键信息的JSON格式输出。

🚀 快速开始

安装依赖库

pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8

使用transformers库进行对话

以下是一个代码片段，展示了如何使用transformers和qwen_vl_utils库来使用该对话模型：

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import torch
MODEL_ID = "AJNG/qwen-vl-2.5-3B-finetuned-cheque"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16)

MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/kaggle/input/testch/Handwritten-legal-amount.png",
            },
            {"type": "text", "text": "extract in json"},
        ],
    }
]
# 推理前的准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

✨ 主要特性

针对性优化：基于个人支票数据集对Qwen2.5-VL-3B-Instruct进行微调，专门用于从支票图像中提取结构化的金融信息。
结构化输出：处理支票图像后，生成包含支票号码、收款人、金额和签发日期等关键信息的JSON格式输出。
多领域应用：可应用于银行金融服务、会计和工资系统、AI OCR管道以及企业文档管理等多个领域。

📦 安装指南

pip install -q git+https://github.com/huggingface/transformers accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8

💻 使用示例

基础用法

from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from qwen_vl_utils import process_vision_info
import torch
MODEL_ID = "AJNG/qwen-vl-2.5-3B-finetuned-cheque"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16)

MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/kaggle/input/testch/Handwritten-legal-amount.png",
            },
            {"type": "text", "text": "extract in json"},
        ],
    }
]
# 推理前的准备
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

📚 详细文档

模型详情

模型描述

基于个人支票数据集微调的Qwen2.5-VL-3B-Instruct是一款视觉语言模型（VLM），旨在从支票图像中提取结构化的金融信息。它处理支票图像，并输出包含支票号码、收款人、总金额和签发日期等关键信息的结构化JSON。该模型遵循ChatML格式，并在特定的支票数据集上进行了微调，以提高金融文档处理的准确性。

开发者：对Qwen2.5-VL-3B-Instruct进行独立微调
模型类型：用于支票信息提取的视觉语言模型
语言：主要为英语（针对金融术语进行了优化）
许可证：[需要更多信息]
微调基础模型：Qwen/Qwen2.5-VL-3B-Instruct

image/png

用途

该模型旨在用于自动支票处理和结构化数据提取。它可以分析支票图像，并生成包含关键金融信息的JSON格式输出。该模型可应用于以下领域：

银行和金融服务：自动化支票验证和处理。
会计和工资系统：提取金融信息进行记录保存。
AI OCR管道：通过结构化输出增强传统OCR系统。
企业文档管理：从扫描的支票中自动提取金融数据。

直接使用

该模型可以进一步微调或集成到更大的应用程序中，例如：

自定义AI金融处理工具
金融机构的多文档解析工作流程
用于银行自动化的智能聊天机器人

适用范围外的使用

与支票无关的通用OCR应用：该模型专门针对支票图像处理进行了优化，可能在其他文档类型上表现不佳。
手写支票识别：该模型主要处理打印支票，可能难以处理草书手写体。
非英语支票处理：虽然它在英语金融环境中进行了训练，但可能无法很好地推广到其他语言的支票。

训练详情

训练数据

数据集由支票图像和相应的JSON注释组成，格式如下：

{
  "image": "1.png", 
  "prefix": "Format the json as shown below",  
  "suffix": "{\"check_reference\": , \"beneficiary\": \"\", \"total_amount\": , \"customer_issue_date\": \"\", \"date_issued_by_bank\": \"\"}"
}

图像文件夹：包含相应的支票图像。
注释：结构化JSON，指定支票详细信息，如支票号码、收款人、金额、客户签发日期和银行签发日期。

训练过程

模型配置设置了图像处理的最小和最大像素限制，确保与Qwen2.5-VLProcessor兼容。处理器使用预训练的模型ID进行初始化，并设置这些约束。然后，使用Torch数据类型设置为bfloat16加载Qwen2.5-VLForConditionalGeneration模型，以实现优化性能。

最后，使用get_peft_model对模型应用LoRA（低秩适应），在微调特定层时减少内存开销。

config = {
    "max_epochs": 4,
    "batch_size": 1,
    "lr": 2e-4,
    "check_val_every_n_epoch": 2,
    "gradient_clip_val": 1.0,
    "accumulate_grad_batches": 8,
    "num_nodes": 1,
    "warmup_steps": 50,
    "result_path": "qwen2.5-3b-instruct-cheque-manifest"
}

计算基础设施

GPU：NVIDIA A100

🔧 技术细节

模型配置

LoRA应用

最后，使用get_peft_model对模型应用LoRA（低秩适应），在微调特定层时减少内存开销。

📄 许可证

[需要更多信息]

📚 引用

如果您觉得我们的工作有帮助，请随意引用我们的工作。

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}
@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}
@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}