🚀 Nanonets-OCR-s图像文字转文本模型
Nanonets-OCR-s是一款强大的、先进的图像转Markdown光学字符识别(OCR)模型,它远远超越了传统的文本提取功能。该模型能够将文档转换为结构化的Markdown格式,具备智能内容识别和语义标记功能,非常适合供大语言模型(LLM)进行下游处理。
✨ 主要特性
Nanonets-OCR-s具备一系列精心设计的功能,能够轻松处理复杂文档:
- LaTeX公式识别:自动将数学方程和公式转换为格式正确的LaTeX语法,可区分行内公式(
$...$
)和显示公式($$...$$
)。
- 智能图像描述:使用结构化的
<img>
标签描述文档内的图像,便于大语言模型处理。能够描述各种类型的图像,包括标志、图表等,并详细说明其内容、样式和上下文。
- 签名检测与分离:识别并分离文档中的签名,将其输出到
<signature>
标签内,这对于处理法律和商业文档至关重要。
- 水印提取:检测并提取文档中的水印文本,将其放置在
<watermark>
标签内。
- 智能复选框处理:将表单中的复选框和单选按钮转换为标准化的Unicode符号(
‚òê
, ‚òë
, ‚òí
),以实现一致且可靠的处理。
- 复杂表格提取:准确提取文档中的复杂表格,并将其转换为Markdown和HTML表格格式。
阅读完整公告 | Hugging Face空间演示
🚀 快速开始
使用transformers库
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
model_path = "nanonets/Nanonets-OCR-s"
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes."""
image = Image.open(image_path)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image", "image": f"file://{image_path}"},
{"type": "text", "text": prompt},
]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
return output_text[0]
image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)
使用vLLM
- 启动vLLM服务器。
vllm serve nanonets/Nanonets-OCR-s
- 使用模型进行预测
from openai import OpenAI
import base64
client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")
model = "nanonets/Nanonets-OCR-s"
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def ocr_page_with_nanonets_s(img_base64):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_base64}"},
},
{
"type": "text",
"text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes.",
},
],
}
],
temperature=0.0,
max_tokens=15000
)
return response.choices[0].message.content
test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))
使用docext
pip install docext
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s
更多详细信息请查看 GitHub。
📚 详细文档
BibTex引用
@misc{Nanonets-OCR-S,
title={Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
author={Souvik Mandal and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar},
year={2025},
}