VideoScore-v1.1开源视频质量评估模型 - 支持48帧推理，文本与视频对齐评分佳

首页

Videoscore V1.1

由 TIGER-Lab 开发

VideoScore-v1.1是基于Mantis-8B-Idefics2的视频质量评估模型，支持48帧推理，在文本到视频对齐子评分上表现优异。

文本生成视频

Transformers

英语开源协议:MIT #视频质量评估 #多维度评分 #文本-视频对齐

下载量 703

发布时间 : 11/28/2024

模型简介

VideoScore系列是用于视频质量评估的模型，能够从多个维度评估AI生成视频的质量，包括视觉质量、时间一致性、动态程度、文本到视频对齐和事实一致性。

模型特点

多维度评估

能够从视觉质量、时间一致性、动态程度、文本到视频对齐和事实一致性五个维度评估视频质量。

高帧数支持

支持处理48帧视频，相比前代模型有显著提升。

高性能

在VideoFeedback-test上达到74.0的Spearman相关性，超越GPT-4o等基线模型。

回归模型

直接输出1.0-4.0的评分，而非分类结果。

模型能力

视频质量评估

多维度评分

文本到视频对齐分析

事实一致性检查

使用案例

AI生成视频评估

视频生成模型评估

评估AI生成视频的质量，为视频生成模型提供反馈。

与人类评估高度一致，Spearman相关性达74.0

视频内容审核

检查生成视频是否符合事实和常识。

在事实一致性维度提供可靠评分

视频质量研究

视频质量基准测试

为视频质量研究提供标准化评估工具。

在GenAI-Bench和VBench上超越最佳基线

🚀 VideoScore-v1.1视频质量评估模型

VideoScore-v1.1是一个视频质量评估模型，以Mantis-8B-Idefics2为基础模型，在大规模视频评估数据集VideoFeedback上训练得到。该模型能从多个维度对视频质量进行评分，与人类评估高度一致，在多个基准测试中表现出色。

🚀 快速开始

你可以通过以下链接快速了解和使用VideoScore-v1.1：

VideoScore

✨ 主要特性

新版本优势：尝试使用新版本VideoScore-v1.1，它是VideoScore的变体，在“文本与视频对齐”子分数方面表现更好，并且现在推理时支持48帧。它以Mantis-8B-Idefics2为基础模型，在VideoFeedback数据集上进行训练。
模型系列：VideoScore系列是视频质量评估模型系列，以Mantis-8B-Idefics2或Qwen/Qwen2-VL为基础模型，并在VideoFeedback（一个具有多方面人类评分的大型视频评估数据集）上进行训练。
评估表现：与VideoScore一样，VideoScore-v1.1在VideoFeedback测试集上与人类评分的Spearman相关性约为75，超过了所有多模态大语言模型（MLLM）提示方法和基于特征的指标。VideoScore-v1.1在另外两个基准测试GenAI-Bench和VBench上也击败了最佳基线，显示出与人类评估的高度一致性。有关这些基准测试的数据详情，请参考VideoScore-Bench。
模型类型：VideoScore-v1.1是一个回归版本的模型。

📦 安装指南

你可以使用以下命令安装VideoScore：

pip install git+https://github.com/TIGER-AI-Lab/VideoScore.git
# 或者
# pip install mantis-vl

💻 使用示例

基础用法

以下是一个使用VideoScore-v1.1进行推理的示例代码：

import av
import numpy as np
from typing import List
from PIL import Image
import torch
from transformers import AutoProcessor
from mantis.models.idefics2 import Idefics2ForSequenceClassification
def _read_video_pyav(
    frame_paths:List[str], 
    max_frames:int,
):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

ROUND_DIGIT=3
REGRESSION_QUERY_PROMPT = """
Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
please watch the following frames of a given video and see the text prompt for generating the video,
then give scores from 5 different dimensions:
(1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color
(2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements
(3) dynamic degree, the degree of dynamic changes
(4) text-to-video alignment, the alignment between the text prompt and the video content
(5) factual consistency, the consistency of the video content with the common-sense and factual knowledge
for each dimension, output a float number from 1.0 to 4.0,
the higher the number is, the better the video performs in that sub-score, 
the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video)
Here is an output example:
visual quality: 3.2
temporal consistency: 2.7
dynamic degree: 4.0
text-to-video alignment: 2.3
factual consistency: 1.8
For this video, the text prompt is "{text_prompt}",
all the frames of video are as follows:
"""

# MAX_NUM_FRAMES=16
# model_name="TIGER-Lab/VideoScore"

# =======================================
# we support 48 frames in VideoScore-v1.1
# =======================================
MAX_NUM_FRAMES=48
model_name="TIGER-Lab/VideoScore-v1.1"

video_path="video1.mp4"
video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense."

processor = AutoProcessor.from_pretrained(model_name,torch_dtype=torch.bfloat16)
model = Idefics2ForSequenceClassification.from_pretrained(model_name,torch_dtype=torch.bfloat16).eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# sample uniformly 8 frames from the video
container = av.open(video_path)
total_frames = container.streams.video[0].frames
if total_frames > MAX_NUM_FRAMES:
    indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int)
else:
    indices = np.arange(total_frames)

frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)]
eval_prompt = REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt)
num_image_token = eval_prompt.count("<image>")
if num_image_token < len(frames):
    eval_prompt += "<image> " * (len(frames) - num_image_token)
flatten_images = []
for x in [frames]:
    if isinstance(x, list):
        flatten_images.extend(x)
    else:
        flatten_images.append(x)

flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images]
inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits
num_aspects = logits.shape[-1]
aspect_scores = []
for i in range(num_aspects):
    aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT))

print(aspect_scores)
"""
model output on visual quality, temporal consistency, dynamic degree,
text-to-video alignment, factual consistency, respectively
VideoScore: 
[2.297, 2.469, 2.906, 2.766, 2.516]

VideoScore-v1.1:
[2.328, 2.484, 2.562, 1.969, 2.594]
"""

训练

有关训练的详细信息，请参考VideoScore/training。

评估

有关评估的详细信息，请参考VideoScore/benchmark。

📚 详细文档

评估结果

我们在VideoFeedback测试集上对VideoScore-v1.1进行了测试，并将模型输出与人类评分在所有评估方面的Spearman相关性平均值作为指标。评估结果如下：

指标	VideoFeedback测试集
VideoScore-v1.1	74.0
Gemini-1.5-Pro	22.1
Gemini-1.5-Flash	20.8
GPT-4o	23.1
CLIP-sim	8.9
DINO-sim	7.5
SSIM-sim	13.4
CLIP-Score	-7.2
LLaVA-1.5-7B	8.5
LLaVA-1.6-7B	-3.1
X-CLIP-Score	-1.9
PIQE	-10.1
BRISQUE	-20.3
Idefics2	6.5
MSE-dyn	-5.5
SSIM-dyn	-12.9

VideoScore系列中的最佳结果用粗体表示，基线中的最佳结果用下划线表示。

📄 许可证

本项目采用MIT许可证。

📖 引用

如果你使用了该模型或相关代码，请引用以下论文：

@article{he2024videoscore,
  title = {VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation},
  author = {He, Xuan and Jiang, Dongfu and Zhang, Ge and Ku, Max and Soni, Achint and Siu, Sherman and Chen, Haonan and Chandra, Abhranil and Jiang, Ziyan and Arulraj, Aaran and Wang, Kai and Do, Quy Duc and Ni, Yuansheng and Lyu, Bohan and Narsupalli, Yaswanth and Fan, Rongqi and Lyu, Zhiheng and Lin, Yuchen and Chen, Wenhu},
  journal = {ArXiv},
  year = {2024},
  volume={abs/2406.15252},
  url = {https://arxiv.org/abs/2406.15252},
}