Florence-2-FT-DocVQA开源文档视觉问答模型

首页

Florence 2 FT DocVQA

由 sahilnishad 开发

基于Florence-2-base微调的文档视觉问答模型，专门用于处理文档图像中的问答任务。

图像生成文本

Transformers

英语开源协议:MIT #文档图像问答 #多模态处理 #Florence-2微调

下载量 4,928

发布时间 : 11/2/2024

模型简介

该模型在DocumentVQA数据集上进行了微调，能够理解文档图像内容并回答相关问题，适用于各种文档分析场景。

模型特点

文档图像理解

能够解析和理解文档图像中的内容和结构

问答能力

针对文档内容提供准确的问答功能

多模态处理

同时处理视觉和文本信息，实现跨模态理解

模型能力

文档图像分析

视觉问答

文本提取

跨模态理解

使用案例

文档处理

合同分析

从合同文档中提取关键条款和条件

发票处理

识别发票中的金额、日期和供应商信息

教育

试卷批改

自动批改学生答卷并提取答案

🚀 基于DocumentVQA数据集微调的Florence - 2模型

本项目是在DocumentVQA数据集上对Florence - 2模型进行微调，使其能够在文档图像上进行问答。该模型具有多模态处理能力，可用于图像到文本的转换、视觉问答等任务。

🚀 快速开始

安装依赖

!pip install torch transformers datasets flash_attn

加载模型和处理器

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("sahilnishad/Florence-2-FT-DocVQA", trust_remote_code=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

运行推理

def run_inference(task_prompt, question, image):
    prompt = task_prompt + question

    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024,
            num_beams=3
        )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return generated_text

示例

from PIL import Image
from datasets import load_dataset

data = load_dataset("HuggingFaceM4/DocumentVQA")

question = "What do you see in this image?"
image = data['train'][0]['image']
print(run_inference("<DocVQA>", question, image))

📚 详细文档

项目Github地址：点击查看

📄 许可证

本项目采用MIT许可证。

📚 引用信息

@misc{sahilnishad_florence_2_ft_docvqa,
  author       = {Sahil Nishad},
  title        = {Fine-Tuning Florence-2 For Document Visual Question-Answering},
  year         = {2024},
  url          = {https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA},
  note         = {Model available on HuggingFace Hub},
  howpublished = {\url{https://huggingface.co/sahilnishad/Florence-2-FT-DocVQA}},
}

📦 模型信息

属性	详情
模型类型	基于Florence - 2的微调模型
训练数据	HuggingFaceM4/DocumentVQA
基础模型	microsoft/Florence-2-base
标签	transformers, florence2, document - vqa, vqa, image - to - text, multimodal, question - answering