Florence-2-large-TableDetection开源表格检测模型

首页

Florence 2 Large TableDetection

由 ucsahin 开发

基于Florence-2模型微调的多模态表格检测模型，能够精准定位图像中的表格区域

图像生成文本

Transformers

开源协议:MIT #表格检测 #多模态模型 #文档处理

下载量 1,993

发布时间 : 6/24/2024

模型简介

这是一个多模态语言模型，针对在给定文本提示的情况下检测图像中表格的任务进行了微调。该模型利用图像和文本输入的组合，预测所提供图像中表格周围的边界框。

模型特点

多模态输入

同时处理图像和文本输入，实现更精准的表格检测

高精度检测

经过专门微调，能够准确识别图像中的表格区域

端到端解决方案

从输入图像到输出边界框的完整解决方案

模型能力

图像中的表格检测

边界框预测

多模态处理

使用案例

文档处理

PDF表格提取

从扫描的PDF文档中自动检测和提取表格

准确识别表格位置，便于后续数据提取

数据提取

表格数据数字化

将纸质文档中的表格转换为数字格式

提高数据录入效率，减少人工操作

🚀 Florence-2-large-TableDetection

本项目是一个基于图像文本输入进行表格检测的多模态语言模型，通过微调实现对图像中表格的精准定位，可广泛应用于文档处理、数据提取等领域。

🚀 快速开始

在Transformers中，你可以按以下方式加载模型并进行推理：（注意，运行该模型需要设置trust_remote_code=True，它只会从原始的HuggingFaceM4/Florence - 2 - DocVQA下载外部自定义代码。）

from transformers import AutoProcessor, AutoModelForCausalLM
import matplotlib.pyplot as plt
import matplotlib.patches as patches

model_id = "ucsahin/Florence-2-large-TableDetection"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map="cuda") # load the model on GPU
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

def run_example(task_prompt, image, max_new_tokens=128):
    prompt = task_prompt
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    generated_ids = model.generate(
      input_ids=inputs["input_ids"].cuda(),
      pixel_values=inputs["pixel_values"].cuda(),
      max_new_tokens=max_new_tokens,
      early_stopping=False,
      do_sample=False,
      num_beams=3,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )
    return parsed_answer

def plot_bbox(image, data):
   # Create a figure and axes
    fig, ax = plt.subplots()
    # Display the image
    ax.imshow(image)
    # Plot each bounding box
    for bbox, label in zip(data['bboxes'], data['labels']):
        # Unpack the bounding box coordinates
        x1, y1, x2, y2 = bbox
        # Create a Rectangle patch
        rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=1, edgecolor='r', facecolor='none')
        # Add the rectangle to the Axes
        ax.add_patch(rect)
        # Annotate the label
        plt.text(x1, y1, label, color='white', fontsize=8, bbox=dict(facecolor='red', alpha=0.5))
    # Remove the axis ticks and labels
    ax.axis('off')
    # Show the plot
    plt.show()

########### Inference
from datasets import load_dataset

dataset = load_dataset("ucsahin/pubtables-detection-1500-samples")

example_id = 5
image = dataset["train"][example_id]["image"]

parsed_answer = run_example("<OD>", image=image)
plot_bbox(image, parsed_answer["<OD>"])

✨ 主要特性

这是一个多模态语言模型，针对给定文本提示在图像中检测表格的任务进行了微调。该模型结合图像和文本输入，预测提供图像中表格周围的边界框。
该模型的主要目的是协助自动化图像中表格的检测过程。它可用于各种应用，如文档处理、数据提取和图像分析，在这些应用中，识别图像中的表格至关重要。

📚 详细文档

模型描述

此模型是一个多模态语言模型，针对在给定文本提示的情况下检测图像中表格的任务进行了微调。该模型利用图像和文本输入的组合，预测所提供图像中表格周围的边界框。
该模型的主要用途是协助自动化图像中表格的检测过程。它可应用于各种场景，如文档处理、数据提取和图像分析，在这些场景中，识别图像中的表格是必不可少的。

训练超参数

以下超参数在训练过程中使用：

学习率：1e - 06
训练批次大小：8
评估批次大小：8
随机种子：42
优化器：Adam（β1 = 0.9，β2 = 0.999，ε = 1e - 08）
学习率调度器类型：线性
训练轮数：10

训练结果

训练损失	轮数	步数	验证损失
1.3199	1.0	169	1.0372
0.7922	2.0	338	0.9169
0.6824	3.0	507	0.8411
0.6109	4.0	676	0.8168
0.5752	5.0	845	0.7915
0.5605	6.0	1014	0.7862
0.5291	7.0	1183	0.7740
0.517	8.0	1352	0.7683
0.5139	9.0	1521	0.7642
0.5005	10.0	1690	0.7601