Heron-NVILA-Lite-1B开源模型 - 支持日英双语的图文交互功能

首页

Heron NVILA Lite 1B

由 turing-motors 开发

基于NVILA-Lite架构训练的日语视觉语言模型，支持日语和英语的图文交互

图像生成文本

Safetensors

支持多种语言开源协议:Apache-2.0 #日语视觉问答 #轻量多模态 #对话式AI

下载量 460

发布时间 : 3/24/2025

模型简介

Heron-NVILA-Lite-1B是一个轻量级的视觉语言模型，能够处理图像和文本输入，生成自然语言响应。它特别针对日语场景进行了优化，同时支持英语。

模型特点

轻量级架构

采用高效的1B参数设计，平衡了性能和计算资源需求

多模态理解

能够同时处理图像和文本输入，理解两者之间的关系

日语优化

专门针对日语场景进行了训练和优化

对话式交互

支持多轮图文对话，保持上下文一致性

模型能力

图像描述生成

视觉问答

多模态对话

跨语言理解

图像内容比较

使用案例

智能客服

产品图像咨询

用户上传产品图片，获取产品信息和购买建议

教育辅助

视觉化学习

根据教材图片生成解释性文字

内容审核

图像内容分析

识别和描述图像中的敏感内容

🚀 苍鹭-NVILA-Lite-1B

苍鹭-NVILA-Lite-1B是一款基于NVILA-Lite架构、为日语训练的视觉语言模型。它能够处理图像和文本信息，在多模态交互场景中发挥作用。

🚀 快速开始

环境配置

# 我已确认4.46.0和4.49.0版本也可以正常使用。其他版本的Transformer可能也能正常工作，但我尚未进行测试。
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

代码示例

from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-1B"

# 你可以使用配置文件
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# 或者直接从预训练模型加载
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# 显示聊天模板
print(model.tokenizer.chat_template)

# 纯文本生成示例
response = model.generate_content(["こんにちは"])
print(response)
print("---" * 40)

# 文本 + 图像生成示例
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "画像を説明してください。"])
print(response)
print("---" * 40)

# 使用生成配置进行生成的示例
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
    [image, "画像を説明してください。"],
    generation_config=generation_config
)
print(response)
print("---" * 40)

# 文本 + 图像 + 文本 + 图像 + 文本生成示例
from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "これは日本の画像です",
    images[1],
    "これはオーストリアの画像です",
    "各画像の違いを説明して"])
print(response)
print("---" * 40)

✨ 主要特性

多语言支持：支持日语和英语，能够满足不同语言用户的需求。
多模态处理：可以处理图像和文本信息，实现图像描述、图像问答等功能。

📦 安装指南

# 我已确认4.46.0和4.49.0版本也可以正常使用。其他版本的Transformer可能也能正常工作，但我尚未进行测试。
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

💻 使用示例

基础用法

from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-1B"

# 你可以使用配置文件
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# 或者直接从预训练模型加载
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# 显示聊天模板
print(model.tokenizer.chat_template)

# 纯文本生成示例
response = model.generate_content(["こんにちは"])
print(response)
print("---" * 40)

高级用法

# 文本 + 图像 + 文本 + 图像 + 文本生成示例
from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "これは日本の画像です",
    images[1],
    "これはオーストリアの画像です",
    "各画像の違いを説明して"])
print(response)
print("---" * 40)

📚 详细文档

模型概述

属性	详情
开发者	图灵公司
视觉编码器	paligemma-siglip-so400m-patch14-448
投影器	mlp_downsample_2x2_fix
大语言模型	Qwen2.5-0.5B-Instruct
支持语言	日语、英语

训练总结

阶段	训练内容	数据来源	样本数量
阶段1	投影器	日语图像文本对，LLaVA预训练数据	110万
阶段2	投影器、大语言模型	过滤后的MOMIJI (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05)	1300万
		日语图像文本对（子集），日语交错数据（子集），mmc4-core（子集），coyo-700m（子集），日语维基百科，日语LLaVA预训练数据，stair_captions	2000万
阶段3	视觉编码器、投影器、大语言模型	llava-instruct-v1_5-en-subset-358k，llava-instruct-ja，日语照片对话，日语视觉问答，synthdog-ja（子集），ai2d，synthdog-en，sherlock	110万

评估

我使用了llm-jp-eval-mm进行评估。除苍鹭-NVILA-Lite和Sarashina2-Vision-14B之外的模型分数取自2025年3月的llm-jp-eval-mm排行榜和浅葱网站。苍鹭-NVILA-Lite和Sarashina2-Vision-14B使用“gpt-4o-2024-05-13”作为评判模型进行评估。Sarashina2-Vision-14B在官方博客上使用“gpt-4o-2024-08-06”进行评估；请注意，由于评估条件不同，Sarashina2-Vision-14B的结果仅作参考。

模型	大语言模型规模	Heron-Bench整体大语言模型得分（%）	JA-VLM-Bench-In-the-Wild大语言模型得分（满分5分）	JA-VG-VQA-500大语言模型得分（满分5分）
苍鹭-NVILA-Lite-1B	0.5B	45.9	2.92	3.16
苍鹭-NVILA-Lite-2B	1.5B	52.8	3.52	3.50
苍鹭-NVILA-Lite-15B	14B	59.6	4.2	3.82
LLaVA-CALM2-SigLIP	7B	43.3	3.15	3.21
Llama-3-EvoVLM-JP-v2	8B	39.3	2.92	2.96
VILA-jp	13B	57.2	3.69	3.62
浅葱-14B	13B	55.8	3.44	3.84
Sarashina2-Vision-14B	13B	50.9	4.1	3.43
Qwen2-VL 7B Instruct	7B	55.5	3.61	3.6
GPT-4o	-	87.6	3.85	3.58