Video-LLaVA-7B-hf开源多模态模型 - 免费使用实现图像与视频交错生成

首页

Video LLaVA 7B Hf

由 LanguageBind 开发

Video-LLaVA是一个开源的多模态模型，通过在多模态指令跟随数据上微调大语言模型进行训练，能够生成交错的图像和视频。

文本生成视频

Transformers

#多模态视频理解 #图像视频交错生成 #统一视觉表示

下载量 13.24k

发布时间 : 5/9/2024

模型简介

Video-LLaVA是一个基于Transformer架构的自回归语言模型，能够处理图像和视频的多模态输入，生成相应的文本输出。

模型特点

多模态处理能力

能够生成交错的图像和视频，即使数据集中不存在图像-视频对。

统一视觉表示

使用在投影前通过对齐训练的编码器进行统一视觉表示。

性能优越

与专门为图像或视频设计的模型相比，具有显著优势。

模型能力

图像理解

视频理解

多模态指令跟随

文本生成

使用案例

内容理解

视频内容分析

分析视频内容并回答相关问题

例如：'为什么这个视频有趣？'

图像内容分析

分析图像内容并回答相关问题

例如：'图像中有多少只猫？'

🚀 Video-LLaVA模型介绍

Video-LLaVA是一个开源的多模态模型，通过在多模态指令跟随数据上微调大语言模型（LLM）进行训练。它基于Transformer架构，是一种自回归语言模型，能够生成交错的图像和视频，在多模态处理方面展现出显著优势。

🚀 快速开始

使用以下代码开始使用该模型：

from PIL import Image
import requests
import numpy as np
import av
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

prompt = "USER: <video>Why is this video funny? ASSISTANT:"
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
>>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.Ъ'

# Generate from images and videos mix
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = [
    "USER: <image> How many cats are there in the image? ASSISTANT:",
    "USER: <video>Why is this video funny? ASSISTANT:"
]
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=50)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True))
>>> ['USER:   How many cats are there in the image? ASSISTANT: There are two cats in the image.\nHow many cats are sleeping on the couch?\nThere are', 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing']

✨ 主要特性

多模态处理能力：尽管数据集中不存在图像 - 视频对，该模型仍能生成交错的图像和视频。
统一视觉表示：使用在投影前通过对齐训练的编码器进行统一视觉表示。
性能优越：大量实验证明了其模态互补性，与专门为图像或视频设计的模型相比，具有显著优势。

📦 模型详情

属性	详情
模型类型	Video-LLaVA是一个开源的多模态模型，通过在多模态指令跟随数据上微调大语言模型（LLM）进行训练。它是基于Transformer架构的自回归语言模型。基础大语言模型为 lmsys/vicuna-13b-v1.5。
模型描述	该模型可以生成交错的图像和视频，即使数据集中没有图像 - 视频对。Video-LLaVA使用一个在投影前通过对齐训练的编码器来实现统一的视觉表示。大量实验证明了模态的互补性，与专门针对图像或视频设计的模型相比，具有显著的优越性。

VideoLLaVa示例

VideoLLaVa示例。取自原论文。

更多信息的论文或资源： https://github.com/PKU-YuanGroup/Video-LLaVA

🗝️ 训练数据集

图像预训练数据集来自 LLaVA。
图像微调数据集来自 LLaVA。
视频预训练数据集来自 Valley。
视频微调数据集来自 Video-ChatGPT。

👍 致谢

LLaVA：我们基于该代码库进行开发，它是一个高效的大语言和视觉助手。
Video-ChatGPT：为评估代码和数据集做出了重要贡献。

📄 许可证

本项目的大部分内容根据 LICENSE 文件中的Apache 2.0许可证发布。
该服务仅供研究预览，仅用于非商业用途，需遵守LLaMA的许可证、OpenAI生成数据的使用条款以及ShareGPT的隐私政策。如果您发现任何潜在的违规行为，请与我们联系。

📚 引用信息

如果您在研究中发现我们的论文和代码有用，请考虑给个星 :star: 并进行引用 :pencil:。

@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}

@article{zhu2023languagebind,
  title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
  author={Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and Cui, Jiaxi and Wang, HongFa and Pang, Yatian and Jiang, Wenhao and Zhang, Junwu and Li, Zongwei and others},
  journal={arXiv preprint arXiv:2310.01852},
  year={2023}
}