OpenVLA v0.1 7B开源模型 - 支持多种机器人控制的视觉语言动作工具

首页

Openvla V01 7b

由 openvla 开发

OpenVLA v0.1 7B是一个开源视觉-语言-动作模型，基于Open X-Embodiment数据集训练，支持多种机器人控制。

文本生成图像

Transformers

英语开源协议:MIT #机器人动作控制 #多模态视觉语言 #零样本泛化

下载量 30

发布时间 : 6/10/2024

模型简介

OpenVLA v0.1 7B是一个视觉-语言-动作模型，以语言指令和摄像头图像作为输入，生成机器人动作。它支持开箱即用地控制多种机器人，并可通过微调快速适配新的机器人领域。

模型特点

多机器人支持

开箱即用地控制预训练数据中已包含的多种机器人

高效微调

可通过少量演示数据高效微调以适应新任务和机器人设置

开源

所有检查点和训练代码库均以MIT许可证发布

模型能力

机器人动作预测

视觉语言理解

多模态输入处理

使用案例

机器人控制

零样本机器人控制

在预训练数据包含的机器人设置上零样本执行指令

可控制如Widow-X机器人等预训练数据中的机器人

新领域适配

通过微调快速适配新的机器人领域

需要收集目标设置上的演示数据集

🚀 OpenVLA v0.1 7B

OpenVLA v0.1 7B是一个开源的视觉 - 语言 - 动作模型，它基于Open X - Embodiment数据集进行训练。该模型以语言指令和相机图像作为输入，能够生成机器人动作，可直接控制多种机器人，还能通过（参数高效）微调快速适配新的机器人领域。

注意事项

OpenVLA v0.1是我们为开发目的而训练的早期模型；若需获取我们的最佳模型，请查看[openvla/openvla - 7b](https://huggingface.co/openvla/openvla - 7b)。

所有OpenVLA的检查点以及我们的训练代码库均在MIT许可证下发布。如需了解完整详情，请阅读我们的论文并查看我们的项目页面。

🚀 快速开始

OpenVLA 7B可以直接用于控制预训练混合集中所涵盖领域的多种机器人。以下是一个在[BridgeV2环境]中使用Widow - X机器人进行零样本指令跟踪的示例，用于加载openvla - v01 - 7b：

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-v01-7b",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt (note inclusion of system prompt due to Vicuña base model)
image: Image.Image = get_from_camera(...)
system_prompt = (
    "A chat between a curious user and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions."
)
prompt = f"{system_prompt} USER: What action should the robot take to {<INSTRUCTION>}? ASSISTANT:"

# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

如需更多示例，包括在您自己的机器人演示数据集上微调OpenVLA模型的脚本，请查看我们的训练仓库。

✨ 主要特性

多机器人控制：可直接控制多种机器人。
快速适配：能通过（参数高效）微调快速适配新的机器人领域。
零样本使用：可对Open - X预训练混合集中看到的特定组合的实体和领域进行零样本机器人控制。

📚 详细文档

模型概述

开发者：OpenVLA团队，成员来自斯坦福大学、加州大学伯克利分校、谷歌Deepmind和丰田研究院。
模型类型：视觉 - 语言 - 动作（语言、图像 => 机器人动作）
语言（NLP）：英语
许可证：MIT
微调基础模型：[siglip - 224px](https://github.com/TRI - ML/prismatic - vlms)，这是一个视觉语言模型，其训练基础为：
- 视觉骨干网络：SigLIP ViT - So400M/14
- 语言模型：Vicuna v1.5
预训练数据集：[Open X - Embodiment](https://robotics - transformer - x.github.io/)，具体的组件数据集可在此处找到。
仓库地址：https://github.com/openvla/openvla
论文：OpenVLA: An Open - Source Vision - Language - Action Model
项目页面和视频：https://openvla.github.io/

使用方法

OpenVLA模型以语言指令和机器人工作空间的相机图像作为输入，预测由7自由度末端执行器增量组成的（归一化）机器人动作，形式为（x, y, z, 滚动, 俯仰, 偏航, 抓手）。要在实际的机器人平台上执行，动作需要根据每个机器人、每个数据集计算的统计数据进行反归一化。更多信息请查看我们的仓库。

OpenVLA模型可以进行零样本使用，以控制Open - X预训练混合集中看到的特定实体和领域组合的机器人（例如，[带有Widow - X机器人的BridgeV2环境](https://rail - berkeley.github.io/bridgedata/)）。在给定最少的演示数据的情况下，它们还可以针对新任务和机器人设置进行高效的微调；详情请见此处。

适用范围说明

OpenVLA模型不能对新的（未见过的）机器人实体或预训练混合集中未涵盖的设置进行零样本泛化；在这些情况下，我们建议在所需的设置上收集演示数据集，并对OpenVLA模型进行微调。

📄 许可证

本项目采用MIT许可证。

📖 引用

@article{kim24openvla,
    title={OpenVLA: An Open-Source Vision-Language-Action Model},
    author={{Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
    journal = {arXiv preprint arXiv:2406.09246},
    year={2024}
}