Nanonets-OCR-s-GGUF开源OCR模型 - 免费将文档智能转换为Markdown格式

首页

Nanonets OCR S GGUF

由 unsloth 开发

Nanonets-OCR-s是一款先进的图像转Markdown光学字符识别（OCR）模型，能够将文档转换为结构化的Markdown格式，具备智能内容识别和语义标记功能。

图像生成文本

Transformers

英语#结构化Markdown转换 #LaTeX公式识别 #智能文档解析

下载量 2,300

发布时间 : 6/16/2025

模型简介

Nanonets-OCR-s是一款强大的OCR模型，专为将文档转换为结构化的Markdown格式而设计。它不仅能够提取文本，还能识别和标记LaTeX公式、图像、签名、水印等复杂内容，非常适合供大语言模型（LLM）进行下游处理。

模型特点

LaTeX公式识别

自动将数学方程和公式转换为格式正确的LaTeX语法，可区分行内公式和显示公式。

智能图像描述

使用结构化的<img>标签描述文档内的图像，便于大语言模型处理。

签名检测与分离

识别并分离文档中的签名，将其输出到<signature>标签内。

水印提取

检测并提取文档中的水印文本，将其放置在<watermark>标签内。

智能复选框处理

将表单中的复选框和单选按钮转换为标准化的Unicode符号。

复杂表格提取

准确提取文档中的复杂表格，并将其转换为Markdown和HTML表格格式。

模型能力

文档OCR

LaTeX公式识别

图像内容描述

签名检测

水印提取

表格提取

复选框处理

使用案例

文档处理

学术论文处理

将包含数学公式和表格的学术论文转换为结构化Markdown格式。

保留原始文档的结构和语义，便于后续分析和处理。

商业合同处理

提取合同中的文本、签名和水印信息。

自动化处理法律文档，提高效率。

表单处理

识别和转换表单中的复选框和单选按钮。

标准化表单数据，便于后续分析。

🚀 Nanonets-OCR-s图像文字转文本模型

Nanonets-OCR-s是一款强大的、先进的图像转Markdown光学字符识别（OCR）模型，它远远超越了传统的文本提取功能。该模型能够将文档转换为结构化的Markdown格式，具备智能内容识别和语义标记功能，非常适合供大语言模型（LLM）进行下游处理。

Unsloth Dynamic 2.0 实现了卓越的准确性，性能优于其他领先的量化模型。

✨ 主要特性

Nanonets-OCR-s具备一系列精心设计的功能，能够轻松处理复杂文档：

LaTeX公式识别：自动将数学方程和公式转换为格式正确的LaTeX语法，可区分行内公式（ $...$ ）和显示公式（$$...$$）。
智能图像描述：使用结构化的 <img> 标签描述文档内的图像，便于大语言模型处理。能够描述各种类型的图像，包括标志、图表等，并详细说明其内容、样式和上下文。
签名检测与分离：识别并分离文档中的签名，将其输出到 <signature> 标签内，这对于处理法律和商业文档至关重要。
水印提取：检测并提取文档中的水印文本，将其放置在 <watermark> 标签内。
智能复选框处理：将表单中的复选框和单选按钮转换为标准化的Unicode符号（‚òê, ‚òë, ‚òí），以实现一致且可靠的处理。
复杂表格提取：准确提取文档中的复杂表格，并将其转换为Markdown和HTML表格格式。

阅读完整公告 | Hugging Face空间演示

🚀 快速开始

使用transformers库

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR-s"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

使用vLLM

启动vLLM服务器。

vllm serve nanonets/Nanonets-OCR-s

使用模型进行预测

from openai import OpenAI
import base64

client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")

model = "nanonets/Nanonets-OCR-s"

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def ocr_page_with_nanonets_s(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes.",
                    },
                ],
            }
        ],
        temperature=0.0,
        max_tokens=15000
    )
    return response.choices[0].message.content

test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))

使用docext

pip install docext
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s

更多详细信息请查看 GitHub。

📚 详细文档

BibTex引用

@misc{Nanonets-OCR-S,
  title={Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
  author={Souvik Mandal and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar},
  year={2025},
}