Skywork-VL-Reward-7B开源多模态奖励模型 - 基于Qwen2.5架构评估更实用

首页

Skywork VL Reward 7B

由 Skywork 开发

Skywork-VL-Reward-7B是一个7B参数的多模态奖励模型，基于Qwen2.5-VL-7B-Instruct架构，增加了用于训练奖励模型的价值头结构。

多模态融合

Transformers

开源协议:MIT #多模态奖励模型 #视觉语言理解 #强化学习优化

下载量 30

发布时间 : 4/25/2025

模型简介

这是一个高效的多模态理解与推理奖励模型，旨在为多模态强化学习提供支持。

模型特点

多模态理解

能够同时处理图像和文本信息，进行多模态理解与推理。

高效性能

在VL-RewardBench和RewardBench上取得了SOTA成绩。

开源贡献

为开源社区提供了一个强大的多模态奖励模型。

模型能力

多模态理解

图像文本分析

奖励模型训练

使用案例

多模态强化学习

多模态奖励模型训练

用于训练多模态强化学习模型，提供奖励信号。

在VL-RewardBench上获得73.1的SOTA成绩。

图像文本理解

图像文本分析

分析图像和文本的结合信息，提供理解和推理能力。

在RewardBench上获得90.1的高分。

🚀 Skywork-VL-Reward

Skywork-VL-Reward是一款开源的7B多模态奖励模型，基于Qwen2.5-VL-7B-Instruct架构，通过添加价值头结构进行训练。该模型在VL-RewardBench和RewardBench等评估基准中取得了优异成绩，为多模态强化学习领域注入了新的活力。

🚀 快速开始

环境搭建

conda create -n vl-reward python=3.11
conda activate vl-reward
bash setup.sh

运行推理代码

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from trl import AutoModelForCausalLMWithValueHead
from qwen_vl_utils import process_vision_info
from transformers.utils import cached_file
from safetensors import safe_open


processor = AutoProcessor.from_pretrained("Skywork/Skywork-VL-Reward-7B")
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Skywork/Skywork-VL-Reward-7B", min_pixels=min_pixels, max_pixels=max_pixels)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Skywork/Skywork-VL-Reward-7B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving
# pip install flash-attn --no-build-isolation
#
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Skywork/Skywork-VL-Reward-7B",
#     device_map="auto",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
# )

model = AutoModelForCausalLMWithValueHead.from_pretrained(model)
vhead_file = cached_file(
    path_or_repo_id="Skywork/Skywork-VL-Reward-7B", filename="value_head.safetensors"
)
with safe_open(vhead_file, framework="pt", device="cpu") as f:
    vhead_params = {key: f.get_tensor(key) for key in f.keys()}
model.load_state_dict(vhead_params, strict=False)
model.requires_grad_(False)
model.eval()

# score: 23.89
# if you use flash_attention_2 the score will be 23.76
demo_image = "demo.jpg"
demo_question = "Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.\nQuestion: Is Purple the highest value?\nChoices:\n(A) no\n(B) yes"
demo_answer = "The answer is: B"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": demo_image,
            },
            {
                "type": "text",
                "text": demo_question,
            },
        ],
    },
    {
        "role": "assistant",
        "content": demo_answer,
    },
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=False
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
values = model(**inputs, return_dict=True, use_cache=False)[-1]
scores = values.gather(
    dim=-1, index=(inputs["attention_mask"].sum(dim=-1, keepdim=True) - 1)
)
score = scores[0].item()
print("Reward Score is: ", score)

✨ 主要特性

开源多模态奖励模型：Skywork-VL-Reward是一款开源的7B多模态奖励模型，为多模态强化学习领域提供了新的解决方案。
基于Qwen2.5-VL-7B-Instruct架构：该模型基于Qwen2.5-VL-7B-Instruct架构，通过添加价值头结构进行训练，提高了模型的性能。
优异的评估成绩：在VL-RewardBench和RewardBench等评估基准中，Skywork-VL-Reward取得了优异的成绩，证明了其有效性和优越性。

📚 详细文档

🔥 最新消息

2025年5月12日：我们的技术报告已在arXiv上发布，欢迎引用：Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
2025年4月24日：我们发布了Skywork-VL-Reward-7B，这是一款在VLRewardBench上表现出色的多模态奖励模型，并在R1V GitHub仓库中发布了技术报告。

技术报告

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

评估结果

VL-RewardBench

模型名称	模型大小	通用	幻觉	推理	整体准确率	宏平均
专有模型
Claude-3.5-Sonnet(2024-06-22)	-	43.4	55.0	62.3	55.3	53.6
Gemini-1.5-Flash (2024-09-24)	-	47.8	59.6	58.4	57.6	55.3
GPT-4o(2024-08-06)	-	49.1	67.6	70.5	65.8	62.4
Gemini-1.5-Pro(2024-09-24)	-	50.8	72.5	64.2	67.2	62.5
Gemini-2.0-flash-exp(2024-12)	-	50.8	72.6	70.1	68.8	64.5
开源模型
Qwen2-VL-7B-Instruct	7B	31.6	19.1	51.1	28.3	33.9
MAmmoTH-VL-8B	8B	36.0	40.0	52.0	42.2	42.7
Qwen2.5-VL-7B-Instruct	7B	43.4	42.0	63.0	48.0	49.5
InternVL3-8B	8B	60.6	44.0	62.3	57.0	55.6
IXC-2.5-Reward-7B	7B	80.3	65.3	60.4	66.3	68.6
Qwen2-VL-72B-Instruct	72B	38.1	32.8	58.0	39.5	43.0
Molmo-72B-0924	72B	33.9	42.3	54.9	44.1	43.7
QVQ-72B-Preview	72B	41.8	46.2	51.2	46.4	46.4
Qwen2.5-VL-72B-Instruct	72B	47.8	46.8	63.5	51.6	52.7
InternVL3-78B	78B	67.8	52.5	64.5	63.3	61.6
Skywork-VL Reward(我们的模型)	7B	66.0	80.0	61.0	73.1	69.0

RewardBench

模型名称	对话	困难对话	安全	推理	分数
仅语言奖励模型
InternLM2-7B-Reward	99.2	69.5	87.2	94.5	87.6
Skywork-Reward-Llama3.1-8B	95.8	87.3	90.8	96.2	92.5
Skywork-Reward-Llama-3.1-8B-v0.2	94.7	88.4	92.7	96.7	93.1
QRM-Llama3.1-8B-v2	96.4	86.8	92.6	96.8	93.1
多模态奖励模型
Qwen2-VL-7B-Instruct	65.1	50.9	55.8	68.3	60.0
InternVL3-8B	97.2	50.4	83.6	83.9	78.8
Qwen2.5-VL-7B-Instruct	94.3	63.8	84.1	86.2	82.1
IXC-2.5-Reward-7B	90.8	83.8	87.8	90.0	88.1
Skywork-VL Reward(我们的模型)	90.0	87.5	91.1	91.8	90.1

📄 许可证

本项目采用MIT许可证。

📝 引用

如果您在研究中使用了本项目的成果，请引用以下文献：

@misc{wang2025skyworkvlrewardeffectivereward,
      title={Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning}, 
      author={Xiaokun Wang and Chris and Jiangbo Pei and Wei Shen and Yi Peng and Yunzhuo Hao and Weijie Qiu and Ai Jian and Tianyidan Xie and Xuchen Song and Yang Liu and Yahui Zhou},
      year={2025},
      eprint={2505.07263},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.07263}, 
}