Qari-OCR-0.1-VL-2B-Instruct开源模型 - 免费部署精准识别整页阿拉伯文本

首页

Qari OCR 0.1 VL 2B Instruct

由 NAMAA-Space 开发

基于Qwen2 VL模型微调的阿拉伯语OCR模型，专为整页阿拉伯文本识别优化

文字识别

Transformers

阿拉伯语开源协议:Apache-2.0 #阿拉伯语OCR #整页文本识别 #高精度字符提取

下载量 2,965

发布时间 : 2/28/2025

模型简介

该模型是针对阿拉伯语整页文本光学字符识别(OCR)任务优化的视觉语言模型，在阿拉伯语OCR数据集上微调，显著提升了识别准确率

模型特点

高精度阿拉伯语OCR

针对阿拉伯语整页文本优化的识别能力，WER仅0.068，CER仅0.019

整页文本处理

专门针对整页阿拉伯文本识别训练，能处理完整页面内容

量化优化

采用4bit量化技术，在保持性能的同时减少资源占用

特定字体优化

针对Almarai、Amiri、Cairo等常用阿拉伯字体特别优化

模型能力

阿拉伯语印刷体识别

整页文本提取

高精度字符识别

多字体支持

使用案例

文档数字化

阿拉伯古籍数字化

将阿拉伯语古籍和手稿转换为可编辑文本

准确率达98.1%字符识别率

商业文档处理

处理阿拉伯语合同、发票等商业文档

较传统OCR工具提升84%准确率

教育应用

教材数字化

将阿拉伯语教材和学术论文转换为数字文本

BLEU分数达0.860

🚀 Qari-OCR-0.1-VL-2B-Instruct模型

该模型是基于阿拉伯语OCR数据集对unsloth/Qwen2-VL-2B-Instruct进行微调的版本。它经过优化，可对整页文本进行高精度的阿拉伯语光学字符识别（OCR）。

image/png

🚀 快速开始

你可以使用transformers和qwen_vl_utils库加载此模型：

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

✨ 主要特性

基于Qwen2 VL模型微调，在阿拉伯语OCR数据集上进行训练。
能够高精度地提取整页阿拉伯语文本。
经过标准OCR指标评估，在WER、CER和BLEU得分上表现出色。

📦 安装指南

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

💻 使用示例

基础用法

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

📚 详细文档

模型详情

属性	详情
基础模型	Qwen2 VL
微调数据集	阿拉伯语OCR数据集
目标	高精度提取整页阿拉伯语文本
支持语言	阿拉伯语
任务	光学字符识别（OCR）
数据集大小	5000条记录
训练轮数	1

性能评估

该模型已在标准OCR指标上进行评估，包括单词错误率（WER）、字符错误率（CER）和BLEU得分。

指标

模型	单词错误率（WER）↓	字符错误率（CER）↓	BLEU得分↑
Qari v0.1模型	0.068	0.019	0.860
Qwen2 VL 2B	1.344	1.191	0.201
EasyOCR	0.908	0.617	0.152
Tesseract OCR	0.428	0.226	0.410

关键结果

单词错误率（WER）：0.068（单词准确率93.2%）
字符错误率（CER）：0.019（字符准确率98.1%）
BLEU得分：0.860

性能对比

与基础模型相比，单词错误率降低95%。
与基础模型相比，字符错误率降低98%。
与基础模型相比，BLEU得分提高328%。
与Tesseract OCR相比，单词错误率降低84%。
与EasyOCR相比，单词错误率降低92%。

性能对比图表

单词错误率（WER）和字符错误率（CER）对比

BLEU得分对比

局限性

虽然该阿拉伯语OCR模型在特定条件下表现出色，但仍存在一些局限性：

字体依赖：模型使用有限的字体集（Almarai-Regular、Amiri-Regular、Cairo-Regular、Tajawal-Regular和NotoNaskhArabic-Regular）进行训练。因此，在处理其他字体的文本时，尤其是装饰性或风格化字体，其准确性可能会下降。
字体大小限制：训练时使用的固定字体大小为16。字体大小的变化，特别是非常小或非常大的文本，可能会降低识别准确率。
不支持变音符号：模型不支持阿拉伯语变音符号（Tashkeel）。依赖变音符号进行消歧的文本可能无法正确识别。
不支持手写识别：模型未经过手写文本识别训练，仅适用于印刷文档。
整页处理：模型在整页文本识别上进行训练，这可能会影响其在分段文本、裁剪部分或复杂布局（如表格和多列格式）中的文本的性能。

在实际应用中部署该模型时，应考虑这些局限性，以确保最佳性能。

📄 许可证

该模型遵循原始Qwen2 VL模型的许可条款。在商业使用前，请仔细阅读相关条款。

引用

如果您在研究中使用了该模型，请引用：

@misc{QariOCR2025,
  title={Qari-OCR: A High-Accuracy Model for Arabic Optical Character Recognition},
  author={NAMAA},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct}},
  note={Accessed: 2025-03-03}
}