HyperCLOVAX-SEED-Vision-Instruct-3B开源多模态模型 - 支持图文理解、文本及韩语处理

首页

Hyperclovax SEED Vision Instruct 3B

由 naver-hyperclovax 开发

HyperCLOVAX-SEED-Vision-Instruct-3B是由NAVER开发的轻量化多模态模型，具备图文理解和文本生成能力，特别优化了韩语处理能力。

文本生成图像

Transformers

开源协议:其他 #韩语视觉问答 #轻量多模态 #视频理解优化

下载量 160.75k

发布时间 : 4/22/2025

模型简介

该模型基于LLaVA架构，结合视觉编码器和语言模块，支持图像问答、图表解析和视频内容理解等任务，是韩国首个开源的视觉语言模型。

模型特点

轻量化设计

优化计算效率，相比同规模模型能以更少的视觉令牌实现竞争力表现

韩语优化

专为韩语优化的帕累托最优模型，在韩语基准测试中超越同规模开源模型

高效视频处理

通过动态帧采样实现低令牌消耗的视频理解，单视频最大支持1856令牌/108帧

多模态能力

同时支持文本、图像和视频输入，具备图文理解和文本生成能力

模型能力

视觉问答

图表解析

视频内容理解

韩语文本生成

多模态推理

使用案例

内容理解

图像问答

根据输入的图像回答相关问题

在TextVQA-Val基准测试中达到79.2分

视频内容分析

理解视频内容并回答相关问题

在VideoMME基准测试中达到48.2分

商业应用

产品识别

识别图像中的产品并提供相关信息

支持OCR和实体识别辅助输入

🚀 HyperCLOVAX-SEED-Vision-Instruct-3B

HyperCLOVAX-SEED-Vision-Instruct-3B 是由 NAVER 开发的模型，它基于专有骨干模型构建，并通过后训练进行微调。该模型能够理解文本和图像，并生成文本。其轻量级架构设计优化了计算效率，在视觉理解方面表现出色，可处理视觉问答、图表解读等任务。尤其在处理韩语输入时具有优势，有望为增强韩国的自主人工智能能力做出重要贡献。

image/png

🚀 快速开始

使用该模型前，请确保安装以下依赖：

以下是使用示例代码：

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LLM Example
# It is recommended to use the chat template with HyperCLOVAX models.
# Using the chat template allows you to easily format your input in ChatML style.
chat = [
        {"role": "system", "content": "you are helpful assistant!"},
        {"role": "user", "content": "Hello, how are you?"},
        {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
        {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt", tokenize=True)
input_ids = input_ids.to(device="cuda")

# Please adjust parameters like top_p appropriately for your use case.
output_ids = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=True,
        top_p=0.6,
        temperature=0.5,
        repetition_penalty=1.0,
)
print("=" * 80)
print("LLM EXAMPLE")
print(tokenizer.batch_decode(output_ids)[0])
print("=" * 80)

# VLM Example
# For image and video inputs, you can use url, local_path, base64, or bytes.
vlm_chat = [
        {"role": "system", "content": {"type": "text", "text": "System Prompt"}},
        {"role": "user", "content": {"type": "text", "text": "User Text 1"}},
        {
                "role": "user",
                "content": {
                        "type": "image",
                        "filename": "tradeoff_sota.png",
                        "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
                        "ocr": "List the words in the image in raster order. Even if the word order feels unnatural for reading, the model will handle it as long as it follows raster order.",
                        "lens_keywords": "Gucci Ophidia, cross bag, Ophidia small, GG, Supreme shoulder bag",
                        "lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] Gucci Ophidia",
                }
        },
        {
                "role": "user",
                "content": {
                        "type": "image",
                        "filename": "tradeoff.png",
                        "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
                }
        },
        {"role": "assistant", "content": {"type": "text", "text": "Assistant Text 1"}},
        {"role": "user", "content": {"type": "text", "text": "User Text 2"}},
        {
                "role": "user",
                "content": {
                        "type": "video",
                        "filename": "rolling-mist-clouds.mp4",
                        "video": "freenaturestock-rolling-mist-clouds.mp4",
                }
        },
        {"role": "user", "content": {"type": "text", "text": "User Text 3"}},
]

new_vlm_chat, all_images, is_video_list = preprocessor.load_images_videos(vlm_chat)
preprocessed = preprocessor(all_images, is_video_list=is_video_list)
input_ids = tokenizer.apply_chat_template(
        new_vlm_chat, return_tensors="pt", tokenize=True, add_generation_prompt=True,
)

output_ids = model.generate(
        input_ids=input_ids.to(device="cuda"),
        max_new_tokens=8192,
        do_sample=True,
        top_p=0.6,
        temperature=0.5,
        repetition_penalty=1.0,
        **preprocessed,
)
print(tokenizer.batch_decode(output_ids)[0])

⚠️ 重要提示

为确保最高水平的图像理解性能，建议包含光学字符识别（OCR）结果和实体识别（Lens）等额外信息。提供的使用示例是在假设可以获取 OCR 和 Lens 结果的情况下编写的。如果以这种格式输入数据，您可以期待输出质量显著提高。

✨ 主要特性

多模态理解：能够理解文本、图像和视频，并生成文本响应。
轻量级架构：优化计算效率，适合资源受限的环境。
韩语优势：在处理韩语输入时表现出色，在相关基准测试中优于类似规模的开源模型。
视觉理解能力：可处理视觉问答、图表解读等任务，支持无 OCR 处理。

📚 详细文档

基本信息

属性	详情
模型架构	基于 LLaVA 的视觉语言模型，包括基于 Transformer 的大语言模型模块、基于 SigLIP 的视觉编码器和基于 C-Abstractor 的视觉语言连接器
参数数量	大语言模型模块 32 亿 + 视觉模块 4.3 亿
输入/输出格式	文本 + 图像 + 视频 / 文本
上下文长度	16k
知识截止日期	模型使用 2024 年 8 月之前收集的数据进行训练

训练

文本训练

在后期训练中，确保高质量数据至关重要。为克服人工创建或修订大规模数据集的成本和资源限制，以及处理需要专业领域知识的任务时的困难和人为错误风险，使用了由 HyperCLOVA X 驱动的自动验证系统，提高了数据质量和训练效率，从而提升了模型在数学和编码等有明确答案领域的性能。

HyperCLOVAX-SEED-Vision-Instruct-3B 基于 HyperCLOVAX-SEED-Text-Base-3B 开发，并应用了监督微调（SFT）和基于 GRPO 在线强化算法的人类反馈强化学习（RLHF）。

视觉训练

视觉理解功能并非 HyperCLOVA X 初始设计的一部分，因此在设计模型架构时，在不影响现有大语言模型性能的前提下，添加了处理视觉相关任务的能力，如基于图像的问答和图表解读。

该 3B 模型的一个关键重点是优化视频输入令牌的效率，通过仔细调整每帧提取的令牌数量，以尽可能少的令牌实现高效的视频理解。此外，在 RLHF 训练阶段，使用了特定于视觉的 V-RLHF 数据来增强模型的学习能力。

基准测试

文本基准测试

模型	KMMLU (5-shot, acc)	HAE-RAE (5-shot, acc)	CLiCK (5-shot, acc)	KoBEST (5-shot, acc)
HyperCLOVAX-SEED-Text-Base-3B	0.4847	0.7635	0.6386	0.7792
HyperCLOVAX-SEED-Vision-Instruct-3B	0.4422	0.6499	0.5599	0.7180
Qwen2.5-3B-instruct	0.4451	0.6031	0.5649	0.7053
gemma-3-4b-it	0.3895	0.6059	0.5303	0.7262

视觉基准测试

模型名称	每个视频的最大令牌数	VideoMME (Ko)	NAVER-TV-CLIP (Ko)	VideoChatGPT (Ko)	PerceptionTest (En)	ActivityNet-QA (En)	KoNet (Ko)	MMBench-Val (En)	TextVQA-Val (En)	Korean VisIT-Bench (Ko)	图像 (4 个基准测试)	视频 (5 个基准测试)	全部 (9 个基准测试)
HyperCLOVAX-SEED-Vision-Instruct-3B	1856 个令牌，108 帧	48.2	61.0	53.6	55.2	50.6	69.2	81.8	79.2	37.0	46.68	53.70	59.54
HyperCLOVAX-SEED-Vision-Instruct-3B (无 OCR)	1856 个令牌，108 帧	48.2	61.0	53.6	55.2	50.6	36.6	80.7	76.0	43.5	56.74	53.70	55.05
Qwen-2.5-VL-3B	24576 个令牌，768 帧	55.1	48.3	45.6	66.9	55.7	58.3	84.3	79.6	81.5	59.35	54.31	56.55
Qwen-2.5-VL-3B (2000 个令牌)	2000 个令牌，128 帧	50.3	43.9	44.3	58.3	54.2	58.5	84.3	79.3	15.7	59.50	50.18	54.33
Qwen-2.5-VL-7B	24576 个令牌，768 帧	60.6	66.7	51.8	70.5	56.6	68.4	88.3	84.9	85.6	69.34	61.23	64.84
Gemma-3-4B	4096 个令牌，16 帧	45.4	36.8	57.1	50.6	46.3	25.0	79.2	58.9	32.3	48.91	47.24	47.98
GPT4V (gpt-4-turbo-2024-04-09)	未知，原始图像，8 帧	49.1	75.0	55.5	57.4	45.7	38.7	84.2	60.4	52.0	58.88	51.59	54.83
GPT4o (gpt-4o-2024-08-06)	未知，512 调整大小，128 帧	61.6	66.6	61.8	50.2	41.7	60.6	84.2	73.2	50.5	67.15	56.42	61.19
InternV-2-2B	4096 个令牌，16 帧	28.9	21.1	40.2	50.5	50.3	3.3	79.3	75.1	51.1	39.74	38.19	38.88
InternV-2-4B	4096 个令牌，16 帧	33.8	36.0	22.8	54.2	52.0	22.7	83.0	76.9	51.6	46.11	39.75	42.58
InternV-2-8B	4096 个令牌，16 帧	43.7	41.2	32.4	58.5	53.2	28.5	86.6	79.0	97.0	50.32	45.79	47.81