OpenVLA-OFT视觉-语言-动作模型开源！微调优化，提升速度与成功率

首页

Openvla 7b Oft Finetuned Libero Object

由 moojink 开发

OpenVLA-OFT是一个经过优化的视觉-语言-动作模型，通过微调技术显著提升了速度和成功率。

多模态融合

Transformers

开源协议:MIT #视觉-语言-动作微调 #机器人动作生成 #多模态控制

下载量 403

发布时间 : 2/25/2025

模型简介

OpenVLA-OFT是一个视觉-语言-动作模型，专为机器人任务设计，能够根据视觉输入和语言指令生成连续的动作序列。

模型特点

优化的微调技术

采用优化的微调技术，显著提升了模型的速度和成功率。

多模态输入

支持视觉输入（图像）和语言指令（任务描述）的联合处理。

连续动作生成

能够生成连续的机器人动作序列，适用于复杂的机器人任务。

模型能力

视觉-语言联合理解

连续动作生成

机器人任务执行

使用案例

机器人技术

物体操作任务

根据视觉输入和任务描述，生成机器人抓取和移动物体的动作序列。

在LIBERO-Object任务上表现出色

🚀 微调视觉-语言-动作模型：优化速度与成功率

本仓库包含适用于LIBERO-Object的OpenVLA-OFT检查点，相关内容详见论文Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success。OpenVLA-OFT通过采用优化的微调技术，相较于基础的OpenVLA模型有了显著改进。

项目页面：https://openvla-oft.github.io/

代码仓库：https://github.com/openvla-oft/openvla-oft

其他OpenVLA-OFT检查点请见：https://huggingface.co/moojink?search_models=oft

🚀 快速开始

此示例展示了如何使用预训练的OpenVLA-OFT检查点生成动作块。请确保你已按照GitHub README中的说明设置好conda环境。

基础用法

import pickle
from experiments.robot.libero.run_libero_eval import GenerateConfig
from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action
from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM
# Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions)
cfg = GenerateConfig(
    pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial",
    use_l1_regression = True,
    use_diffusion = False,
    use_film = False,
    num_images_in_input = 2,
    use_proprio = True,
    load_in_8bit = False,
    load_in_4bit = False,
    center_crop = True,
    num_open_loop_steps = NUM_ACTIONS_CHUNK,
    unnorm_key = "libero_spatial_no_noops",
)
# Load OpenVLA-OFT policy and inputs processor
vla = get_vla(cfg)
processor = get_processor(cfg)

# Load MLP action head to generate continuous actions (via L1 regression)
action_head = get_action_head(cfg, llm_dim=vla.llm_dim)
# Load proprio projector to map proprio to language embedding space
proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM)

# Load sample observation:
#   observation (dict): {
#     "full_image": primary third-person image,
#     "wrist_image": wrist-mounted camera image,
#     "state": robot proprioceptive state,
#     "task_description": task description,
#   }
with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file:
    observation = pickle.load(file)
# Generate robot action chunk (sequence of future actions)
actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector)
print("Generated action chunk:")
for act in actions:
    print(act)

📄 许可证

本项目采用MIT许可证。

📚 引用

@article{kim2025fine,
  title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
  author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
  journal={arXiv preprint arXiv:2502.19645},
  year={2025}
}