库名称:transformers
许可证:mit
语言:
- 越南语
- 英语
- 中文
基础模型:
- OpenGVLab/InternVL2_5-1B
任务标签:图像文本到文本
Vintern-1B-v3.5 ❄️
我们推出Vintern-1B-v3.5,这是Vintern系列的最新版本,在所有评估基准上相比v2版本均有显著提升。该模型基于InternVL2.5-1B微调而来,由于InternVL 2.5团队在微调过程中使用了Viet-ShareGPT-4o-Text-VQA数据,该模型在越南语🇻🇳任务上已表现优异。
为了在保持现有英语数据集能力的同时进一步提升越南语性能,Vintern-1B-v3.5使用了大量越南语专用数据进行微调。这使得该模型在文本识别、OCR和理解越南特色文档方面表现尤为出色。
亮点 🌟
-
越南语文本处理顶级质量
Vintern-1B-v3.5是同级别(10亿参数)模型中理解和处理越南语文档的最佳选择之一。
-
更强的信息提取与理解能力
该模型擅长处理发票、法律文本、手写体和表格等复杂文档。
-
提升的提示理解能力
相比v2版本,它能理解更复杂的指令提示,使用更加便捷。
-
经济硬件即可运行
仅需Google Colab的T4 GPU即可运行,无需昂贵设备。
-
易于微调
只需少量工作即可针对特定任务进行定制化调整。
🤗 HF演示 🤗
基准测试 📈
测试基准 |
InternVL2_5 1B |
Vintern-1B-v2 |
Vintern-1B-v3.5 |
vi-MTVQA |
24.8 |
37.4 |
41.9 |
DocVQAtest |
84.8 |
72.5 |
78.8 |
InfoVQAtest |
56.0 |
38.9 |
46.4 |
TextVQAval |
72.0 |
64.0 |
68.2 |
ChartQAtest |
75.9 |
34.1 |
65.7 |
OCRBench |
785 |
628 |
706 |
示例展示
快速开始
以下代码片段展示如何加载分词器和模型并生成内容。
可通过Colab推理笔记运行模型:
import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
model = AutoModel.from_pretrained(
"5CD-AI/Vintern-1B-v3_5",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
use_flash_attn=False,
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-1B-v3_5", trust_remote_code=True, use_fast=False)
test_image = 'test-image.jpg'
pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)
question = '<image>\n提取图片中的主要信息并以markdown格式返回。'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'用户: {question}\n助手: {response}')
引用
@misc{doan2024vintern1befficientmultimodallarge,
title={Vintern-1B: 面向越南语的高效多模态大语言模型},
author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang},
year={2024},
eprint={2408.12480},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.12480},
}
参考文献
[1] Z. Chen 等, '通过模型、数据和测试时扩展提升开源多模态模型性能边界', arXiv预印本 arXiv:2412.05271, 2024.