LLaVAction-7B开源动作识别模型 - 支持第一人称视角视频理解

首页

Llavaction 7B

由 MLAdaptiveIntelligence 开发

LLaVAction是一个面向动作识别的多模态大语言模型评估与训练框架，基于Qwen2语言模型架构，支持第一人称视角视频理解。

视频生成文本

Transformers

英语#第一人称动作理解 #64帧长视频处理 #多模态问答

下载量 149

发布时间 : 3/24/2025

模型简介

LLaVAction-7B模型专注于从第一人称视角视频理解人类动作，支持处理最多64帧视频输入，在多个视频理解基准测试上表现优异。

模型特点

第一人称视角理解

专门针对第一人称视角视频优化，能准确理解自我中心视角下的动作和交互

长视频处理能力

支持处理最多64帧视频输入，能有效理解长视频内容

多模态融合

结合视觉和语言信息，实现高质量的视频内容理解和问答

高性能基准测试表现

在多个视频理解基准测试上达到领先水平，如EgoSchema(59%)、MVBench(61.1%)等

模型能力

视频内容理解

动作识别

多模态问答

长视频分析

第一人称视角理解

使用案例

智能家居

厨房活动分析

分析用户在厨房中的烹饪活动

能准确识别切菜、烹饪等动作

行为研究

日常活动分析

研究人类日常活动模式

可识别和分类各种日常活动

辅助技术

动作指导

为特殊需求用户提供动作指导

能理解并指导用户完成特定动作

🚀 LLaVAction-7B

LLaVAction-7B是一个用于动作识别的多模态大语言模型，基于Qwen2语言模型训练，支持最多64帧视频处理，在多个多模态数据集上有不错的准确率表现。

🚀 快速开始

LLaVAction-7B模型基于Qwen2语言模型，在EPIC - KITCHENS - 100 - MQA数据集上进行训练，上下文窗口为32K个标记，最多支持64帧视频。

项目页面：https://mmathislab.github.io/llavaction/
论文：更多详细信息，请查看我们的论文
代码仓库：https://github.com/AdaptiveMotorControlLab/LLaVAction
联系人：Mackenzie Mathis
支持语言：英语

✨ 主要特性

基于Qwen2语言模型，上下文窗口达32K标记。
支持最多64帧视频处理。
在多个多模态数据集上进行评估，有较好的准确率表现。

📦 安装指南

使用前需安装llavaction库：

!pip install llavaction

💻 使用示例

基础用法

#Your video (it assumes an egocentric view point)
video_path = "XXXX"

#These are the prompts we trained with, but you can test others:
perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
task_prompt = "Describe in details what you see from the video frames."

def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time

pretrained = "MLAdaptiveIntelligence/LLaVAction-7B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)

cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

📚 详细文档

模型

架构：SO400M + Qwen2
初始化模型：lmms - lab/LLaVA - Video - 7B - Qwen2
数据：混合LLaVA - 178K和EPIC - KITCHENS - 100 - MQA数据集，训练2个周期，全量模型训练
精度：bfloat16

硬件与软件

GPU：32 * Nvidia GH - 200（用于全模型系列训练）
编排工具：HuggingFace Trainer
神经网络框架：PyTorch

评估指标

数据集	准确率
EgoSchema	59
MVBench	61.1
NextQA	82.8
PercepTest	70.2
LongVideoBench	58.6
VideoMME	63.9
VideoMME (w - subs)	71.4

🔧 技术细节

LLaVAction-7B模型的详细技术细节可参考Ye等人2025年的论文：arxiv.org/abs/2503.18712 。

📄 许可证

本项目采用CC - BY - NC - SA - 4.0许可证。

📚 引用

@article{YeQi2025llavaction,
  title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
  author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.},
  journal={arXiv preprint},
  year={2025}
}