开源LLaVAction-0.5B多模态大模型 - 可高效进行动作识别

首页

Llavaction 0.5B

由 MLAdaptiveIntelligence 开发

LLaVAction是一个用于动作识别的多模态大语言模型，基于Qwen2语言模型，在EPIC-KITCHENS-100-MQA数据集上训练而成。

视频生成文本

Transformers

英语#第一人称动作识别 #多模态视频问答 #长视频理解

下载量 215

发布时间 : 3/24/2025

模型简介

该模型专注于视频动作识别任务，能够理解第一人称视角视频中的动作内容，适用于与EPIC-KITCHENS-100类似的视频内容分析。

模型特点

多模态理解能力

结合视觉和语言信息，能够理解视频内容并生成相关描述

第一人称视角动作识别

专门针对第一人称视角视频中的手部与物体交互动作进行识别

大上下文窗口

支持32K令牌的上下文窗口，适合处理长视频内容

模型能力

视频内容理解

动作识别

多模态问答

视频帧分析

时间信息处理

使用案例

智能家居

厨房活动分析

识别用户在厨房中的各种操作活动

可准确识别切菜、烹饪等常见厨房动作

行为研究

日常活动分析

研究人类日常活动模式和行为习惯

🚀 LLaVAction-0.5B

LLaVAction-0.5B是一个用于动作识别的多模态大语言模型，基于Qwen2语言模型训练，可处理视频文本任务，在动作识别领域有重要应用价值。

🚀 快速开始

LLaVAction-0.5B模型基于Qwen2语言模型，在EPIC - KITCHENS - 100 - MQA数据集上进行训练，上下文窗口为32K个标记。

项目页面：https://mmathislab.github.io/llavaction/
论文：更多详细信息，请查看我们的论文
代码仓库：https://github.com/AdaptiveMotorControlLab/LLaVAction
联系人：Mackenzie Mathis
支持语言：英语

✨ 主要特性

多模态处理：支持视频和文本的多模态输入输出。
动作识别：专注于动作识别任务，可对视频中的动作进行详细描述。
基于强大语言模型：以Qwen2为基础，拥有32K标记的上下文窗口。

💻 使用示例

基础用法

!pip install llavaction

from llavaction.model.builder import load_pretrained_model
from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llavaction.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")

#Your video (it assumes an egocentric view point)
video_path = "XXXX"

#These are the prompts we trained with, but you can test others:
perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
task_prompt = "Describe in details what you see from the video frames."


def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time

pretrained = "MLAdaptiveIntelligence/LLaVAction-0.5B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

🔧 技术细节

训练详情

具体训练细节可参考Ye等人2025年的论文：arxiv.org/abs/2503.18712

模型信息

属性	详情
模型架构	SO400M + Qwen2
初始化模型	lmms - lab/llava - onevision - qwen2 - 0.5b - ov
训练数据	EPIC - KITCHENS - 100 - MQA，2个训练周期，全量模型
精度	bfloat16

硬件与软件

GPU：32 * Nvidia GH - 200（用于整个模型系列的训练）
编排工具：HuggingFace Trainer
神经网络框架：PyTorch

📄 许可证

本项目采用CC - BY - NC - SA 4.0许可证。

📚 引用信息

arXiv: arxiv.org/abs/2503.18712

@article{YeQi2025llavaction,
  title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
  author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.},
  journal={arXiv preprint},
  year={2025}
}