license: other
license_name: hyperclovax-seed
license_link: LICENSE
library_name: transformers

模型概述
HyperCLOVAX-SEED-Vision-Instruct-3B是由NAVER基于自研主干模型开发,通过后训练微调的多模态模型。该模型具备图文理解与文本生成能力,采用轻量化架构设计以优化计算效率。在视觉理解方面,可执行视觉问答(VQA)、图表解析等任务,尤其针对韩语场景实现了帕累托最优平衡。与同规模模型相比,该模型在推理时消耗更少的视觉token,却展现出具有竞争力的性能表现。作为韩国首个开源的图文理解模型,其韩语处理能力显著优于同类开源模型,有望为提升韩国本土AI主权能力做出重要贡献。
基础信息
- 模型架构: 基于LLaVA的视觉语言模型
- 大语言模块: 基于Transformer的稠密模型
- 视觉编码器: 采用SigLIP架构,支持每网格378x378像素输入分辨率
- 视觉语言连接器: 基于C-Abstractor架构,集成AnyRes机制,可处理9个网格总计129万像素
- 参数量: 3.2B(语言模块)+ 0.43B(视觉模块)
- 输入/输出格式: 文本+图像+视频 / 文本
- 上下文长度: 16k
- 知识截止日期: 训练数据收集于2024年8月之前
训练过程
文本模块
通过HyperCLOVA X驱动的自动化验证系统,我们克服了人工标注在成本、专业性和容错性方面的局限,显著提升了数学、编程等确定性任务的表现。模型基于HyperCLOVAX-SEED-Text-Base-3B,采用监督微调(SFT)与基于GRPO在线强化算法的RLHF联合训练策略。
视觉模块
通过精心设计的架构扩展,在保持原有语言能力基础上新增了视觉问答、图表解析等功能。该3B轻量级模型特别优化了视频token效率,通过动态帧采样实现高效视频理解。在RLHF阶段引入视觉专用V-RLHF数据,支持无OCR处理能力。
性能基准
文本能力
模型 |
KMMLU (5-shot, acc) |
HAE-RAE (5-shot, acc) |
CLiCK (5-shot, acc) |
KoBEST (5-shot, acc) |
HyperCLOVAX-SEED-Text-Base-3B |
0.4847 |
0.7635 |
0.6386 |
0.7792 |
HyperCLOVAX-SEED-Vision-Instruct-3B |
0.4422 |
0.6499 |
0.5599 |
0.7180 |
Qwen2.5-3B-instruct |
0.4451 |
0.6031 |
0.5649 |
0.7053 |
gemma-3-4b-it |
0.3895 |
0.6059 |
0.5303 |
0.7262 |
视觉能力
模型名称 |
单视频最大token数 |
VideoMME (韩) |
NAVER-TV-CLIP (韩) |
VideoChatGPT (韩) |
PerceptionTest (英) |
ActivityNet-QA (英) |
KoNet (韩) |
MMBench-Val (英) |
TextVQA-Val (英) |
韩语VisIT-Bench (韩) |
图像(4项) |
视频(5项) |
综合(9项) |
HyperCLOVAX-SEED-Vision-Instruct-3B |
1856 tokens, 108帧 |
48.2 |
61.0 |
53.6 |
55.2 |
50.6 |
69.2 |
81.8 |
79.2 |
37.0 |
46.68 |
53.70 |
59.54 |
HyperCLOVAX-SEED-Vision-Instruct-3B (无OCR) |
1856 tokens, 108帧 |
48.2 |
61.0 |
53.6 |
55.2 |
50.6 |
36.6 |
80.7 |
76.0 |
43.5 |
56.74 |
53.70 |
55.05 |
Qwen-2.5-VL-3B |
24576 tokens, 768帧 |
55.1 |
48.3 |
45.6 |
66.9 |
55.7 |
58.3 |
84.3 |
79.6 |
81.5 |
59.35 |
54.31 |
56.55 |
Qwen-2.5-VL-3B (限2000 tokens) |
2000 tokens, 128帧 |
50.3 |
43.9 |
44.3 |
58.3 |
54.2 |
58.5 |
84.3 |
79.3 |
15.7 |
59.50 |
50.18 |
54.33 |
Qwen-2.5-VL-7B |
24576 tokens, 768帧 |
60.6 |
66.7 |
51.8 |
70.5 |
56.6 |
68.4 |
88.3 |
84.9 |
85.6 |
69.34 |
61.23 |
64.84 |
Gemma-3-4B |
4096 tokens, 16帧 |
45.4 |
36.8 |
57.1 |
50.6 |
46.3 |
25.0 |
79.2 |
58.9 |
32.3 |
48.91 |
47.24 |
47.98 |
GPT4V (gpt-4-turbo-2024-04-09) |
原始图像, 8帧 |
49.1 |
75.0 |
55.5 |
57.4 |
45.7 |
38.7 |
84.2 |
60.4 |
52.0 |
58.88 |
51.59 |
54.83 |
GPT4o (gpt-4o-2024-08-06) |
512尺寸, 128帧 |
61.6 |
66.6 |
61.8 |
50.2 |
41.7 |
60.6 |
84.2 |
73.2 |
50.5 |
67.15 |
56.42 |
61.19 |
InternV-2-2B |
4096 tokens, 16帧 |
28.9 |
21.1 |
40.2 |
50.5 |
50.3 |
3.3 |
79.3 |
75.1 |
51.1 |
39.74 |
38.19 |
38.88 |
InternV-2-4B |
4096 tokens, 16帧 |
33.8 |
36.0 |
22.8 |
54.2 |
52.0 |
22.7 |
83.0 |
76.9 |
51.6 |
46.11 |
39.75 |
42.58 |
InternV-2-8B |
4096 tokens, 16帧 |
43.7 |
41.2 |
32.4 |
58.5 |
53.2 |
28.5 |
86.6 |
79.0 |
97.0 |
50.32 |
45.79 |
47.81 |
依赖项
使用示例
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
chat = [
{"role": "system", "content": "你是得力助手!"},
{"role": "user", "content": "你好,最近怎么样?"},
{"role": "assistant", "content": "我很好。今天能为您做些什么?"},
{"role": "user", "content": "我想演示对话模板的使用!"},
]
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt", tokenize=True)
input_ids = input_ids.to(device="cuda")
output_ids = model.generate(
input_ids,
max_new_tokens=64,
do_sample=True,
top_p=0.6,
temperature=0.5,
repetition_penalty=1.0,
)
print("=" * 80)
print("大语言模型示例")
print(tokenizer.batch_decode(output_ids)[0])
print("=" * 80)
vlm_chat = [
{"role": "system", "content": {"type": "text", "text": "系统提示"}},
{"role": "user", "content": {"type": "text", "text": "用户文本1"}},
{
"role": "user",
"content": {
"type": "image",
"filename": "tradeoff_sota.png",
"image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
"ocr": "按光栅顺序列出图中文字。即使阅读顺序不自然,只要遵循光栅顺序即可处理。",
"lens_keywords": "古驰Ophidia系列 十字包 小号GG Supreme肩包",
"lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] 古驰Ophidia",
}
},
{
"role": "user",
"content": {
"type": "image",
"filename": "tradeoff.png",
"image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
}
},
{"role": "assistant", "content": {"type": "text", "text": "助手回复1"}},
{"role": "user", "content": {"type": "text", "text": "用户文本2"}},
{
"role": "user",
"content": {
"type": "video",
"filename": "rolling-mist-clouds.mp4",
"video": "freenaturestock-rolling-mist-clouds.mp4",
}
},
{"role": "user", "content": {"type": "text", "text": "用户文本3"}},
]
new_vlm_chat, all_images, is_video_list = preprocessor.load_images_videos(vlm_chat)
preprocessed = preprocessor(all_images, is_video_list=is_video_list)
input_ids = tokenizer.apply_chat_template(
new_vlm_chat, return_tensors="pt", tokenize=True, add_generation_prompt=True,
)
output_ids = model.generate(
input_ids=input_ids.to(device="cuda"),
max_new_tokens=8192,
do_sample=True,
top_p=0.6,
temperature=0.5,
repetition_penalty=1.0,
**preprocessed,
)
print(tokenizer.batch_decode(output_ids)[0])
- 为获得最佳图像理解性能,建议提供OCR结果和实体识别(Lens)等辅助信息。示例代码默认包含这些增强输入,实际使用时若提供结构化数据将显著提升输出质量。