许可证: MIT
数据集:
- TIGER-Lab/VideoFeedback
语言:
- 英语
评估指标:
- 准确率/SPCC
库名称: transformers
管道标签: 视频-文本到文本
📃论文 | 🌐网站 | 💻Github | 🛢️数据集 | 🤗模型 (VideoScore) | 🤗演示

简介
评估结果
我们在VideoFeedback-test上测试了VideoScore-v1.1,并以模型输出与人类评分的Spearman相关性作为指标,取所有评估维度的平均值。
评估结果如下:
指标 |
VideoFeedback-test |
VideoScore-v1.1 |
74.0 |
Gemini-1.5-Pro |
22.1 |
Gemini-1.5-Flash |
20.8 |
GPT-4o |
23.1 |
CLIP-sim |
8.9 |
DINO-sim |
7.5 |
SSIM-sim |
13.4 |
CLIP-Score |
-7.2 |
LLaVA-1.5-7B |
8.5 |
LLaVA-1.6-7B |
-3.1 |
X-CLIP-Score |
-1.9 |
PIQE |
-10.1 |
BRISQUE |
-20.3 |
Idefics2 |
6.5 |
MSE-dyn |
-5.5 |
SSIM-dyn |
-12.9 |
VideoScore系列中的最佳结果以粗体显示,基线中的最佳结果以下划线显示。
使用方法
安装
pip install git+https://github.com/TIGER-AI-Lab/VideoScore.git
# 或
# pip install mantis-vl
推理
cd VideoScore/examples
import av
import numpy as np
from typing import List
from PIL import Image
import torch
from transformers import AutoProcessor
from mantis.models.idefics2 import Idefics2ForSequenceClassification
def _read_video_pyav(
frame_paths:List[str],
max_frames:int,
):
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
ROUND_DIGIT=3
REGRESSION_QUERY_PROMPT = """
假设你是评估AI生成视频质量的专家,
请观看以下给定视频的帧并查看生成视频的文本提示,
然后从5个不同维度给出评分:
(1) 视觉质量:视频在清晰度、分辨率、亮度和色彩方面的质量
(2) 时间一致性:物体或人物的一致性以及动作或运动的流畅性
(3) 动态程度:动态变化的程度
(4) 文本到视频对齐:文本提示与视频内容之间的对齐程度
(5) 事实一致性:视频内容与常识和事实知识的一致性
每个维度输出一个1.0到4.0的浮点数,
数字越高,视频在该子评分上表现越好,
最低1.0表示差,最高4.0表示完美/真实(视频像真实视频一样)
以下是一个输出示例:
视觉质量: 3.2
时间一致性: 2.7
动态程度: 4.0
文本到视频对齐: 2.3
事实一致性: 1.8
对于这个视频,文本提示是“{text_prompt}”,
视频的所有帧如下:
"""
MAX_NUM_FRAMES=48
model_name="TIGER-Lab/VideoScore-v1.1"
video_path="video1.mp4"
video_prompt="在大象门村附近,他们在夜晚接近那座闹鬼的房子。拉吉夫感到焦虑,但巴韦什鼓励他。当他们到达房子时,空气中传来神秘的声音,增添了悬念。"
processor = AutoProcessor.from_pretrained(model_name,torch_dtype=torch.bfloat16)
model = Idefics2ForSequenceClassification.from_pretrained(model_name,torch_dtype=torch.bfloat16).eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
container = av.open(video_path)
total_frames = container.streams.video[0].frames
if total_frames > MAX_NUM_FRAMES:
indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int)
else:
indices = np.arange(total_frames)
frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)]
eval_prompt = REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt)
num_image_token = eval_prompt.count("<image>")
if num_image_token < len(frames):
eval_prompt += "<image> " * (len(frames) - num_image_token)
flatten_images = []
for x in [frames]:
if isinstance(x, list):
flatten_images.extend(x)
else:
flatten_images.append(x)
flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images]
inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
num_aspects = logits.shape[-1]
aspect_scores = []
for i in range(num_aspects):
aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT))
print(aspect_scores)
"""
模型在视觉质量、时间一致性、动态程度、
文本到视频对齐、事实一致性上的输出分别如下
VideoScore:
[2.297, 2.469, 2.906, 2.766, 2.516]
VideoScore-v1.1:
[2.328, 2.484, 2.562, 1.969, 2.594]
"""
训练
详情请见VideoScore/training
评估
详情请见VideoScore/benchmark
引用
@article{he2024videoscore,
title = {VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation},
author = {He, Xuan and Jiang, Dongfu and Zhang, Ge and Ku, Max and Soni, Achint and Siu, Sherman and Chen, Haonan and Chandra, Abhranil and Jiang, Ziyan and Arulraj, Aaran and Wang, Kai and Do, Quy Duc and Ni, Yuansheng and Lyu, Bohan and Narsupalli, Yaswanth and Fan, Rongqi and Lyu, Zhiheng and Lin, Yuchen and Chen, Wenhu},
journal = {ArXiv},
year = {2024},
volume={abs/2406.15252},
url = {https://arxiv.org/abs/2406.15252},
}