🚀 EMOVA-Qwen-2.5-7B-HF
EMOVA(情感全模态语音助手)是一种新颖的端到端全模态大语言模型,无需依赖外部模型,就能实现看、听、说的功能。它能根据文本、视觉和语音等全模态输入,结合语音解码器和风格编码器,生成带有生动情感控制的文本和语音响应。该模型具备通用的全模态理解和生成能力,在高级视觉语言理解、情感语音对话以及带有结构数据理解的语音对话方面表现出色。
🚀 快速开始
本仓库包含以HuggingFace格式组织的 EMOVA-Qwen2.5-7B 检查点,因此可以直接使用 transformers Auto API 进行加载。
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
model = AutoModel.from_pretrained(
"Emova-ollm/emova-qwen-2-5-7b-hf",
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2',
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained("Emova-ollm/emova-qwen-2-5-7b-hf", trust_remote_code=True)
speeck_tokenizer = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()
processor.set_speech_tokenizer(speeck_tokenizer)
inputs = dict(
text=[
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What's shown in this image?"}]},
{"role": "assistant", "content": [{"type": "text", "text": "This image shows a red stop sign."}]},
{"role": "user", "content": [{"type": "text", "text": "Describe the image in more details."}]},
],
images=Image.open('path/to/image')
)
inputs = dict(
text=[{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}],
audios='path/to/audio'
)
inputs = dict(
text=[{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}],
images=Image.open('path/to/image'),
audios='path/to/audio'
)
has_speech = 'audios' in inputs.keys()
inputs = processor(**inputs, return_tensors="pt")
inputs = inputs.to(model.device)
gen_kwargs = {"max_new_tokens": 4096, "do_sample": False}
speech_kwargs = {"speaker": "female", "output_wav_prefix": "output"} if has_speech else {}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(processor.batch_decode(outputs, skip_special_tokens=True, **speech_kwargs))
✨ 主要特性
- 先进的全模态性能:EMOVA在视觉语言和语音基准测试中同时取得了最先进的可比结果。其性能最佳的模型 EMOVA-72B 甚至超越了包括GPT - 4o和Gemini Pro 1.5在内的商业模型。
- 情感语音对话:采用了语义 - 声学解耦的语音分词器和轻量级的风格控制模块,实现了无缝的全模态对齐和多样化的语音风格可控性。EMOVA支持具有 24种语音风格控制(即2个说话人、3种音高和4种情感)的 双语(中文和英文) 语音对话。
- 多样化的配置:开源了3种配置,即 EMOVA - 3B/7B/72B,以支持不同计算预算下的全模态使用。你可以查看 模型库 ,为你的计算设备找到最合适的模型!
📚 详细文档
模型信息
属性 |
详情 |
库名称 |
transformers |
标签 |
全模态大语言模型、多模态大语言模型、情感语音对话 |
许可证 |
apache - 2.0 |
数据集 |
Emova - ollm/emova - alignment - 7m、Emova - ollm/emova - sft - 4m、Emova - ollm/emova - sft - speech - 231k |
语言 |
英文、中文 |
基础模型 |
Emova - ollm/qwen2vit600m、Emova - ollm/Qwen2.5 - 7B - Instruct_add_speech_token_4096_nostrip |
模型性能
基准测试 |
EMOVA - 3B |
EMOVA - 7B |
EMOVA - 72B |
GPT - 4o |
VITA 8x7B |
VITA 1.5 |
百川全模态 |
MME |
2175 |
2317 |
2402 |
2310 |
2097 |
2311 |
2187 |
MMBench |
79.2 |
83.0 |
86.4 |
83.4 |
71.8 |
76.6 |
76.2 |
SEED - Image |
74.9 |
75.5 |
76.6 |
77.1 |
72.6 |
74.2 |
74.1 |
MM - Vet |
57.3 |
59.4 |
64.8 |
- |
41.6 |
51.1 |
65.4 |
RealWorldQA |
62.6 |
67.5 |
71.0 |
75.4 |
59.0 |
66.8 |
62.6 |
TextVQA |
77.2 |
78.0 |
81.4 |
- |
71.8 |
74.9 |
74.3 |
ChartQA |
81.5 |
84.9 |
88.7 |
85.7 |
76.6 |
79.6 |
79.6 |
DocVQA |
93.5 |
94.2 |
95.9 |
92.8 |
- |
- |
- |
InfoVQA |
71.2 |
75.1 |
83.2 |
- |
- |
- |
- |
OCRBench |
803 |
814 |
843 |
736 |
678 |
752 |
700 |
ScienceQA - Img |
92.7 |
96.4 |
98.2 |
- |
- |
- |
- |
AI2D |
78.6 |
81.7 |
85.8 |
84.6 |
73.1 |
79.3 |
- |
MathVista |
62.6 |
65.5 |
69.9 |
63.8 |
44.9 |
66.2 |
51.9 |
Mathverse |
31.4 |
40.9 |
50.0 |
- |
- |
- |
- |
Librispeech (WER↓) |
5.4 |
4.1 |
2.9 |
- |
3.4 |
8.1 |
- |
模型索引
- 名称:emova - qwen - 2 - 5 - 7b - hf
- 结果:
- 任务类型:多模态
- 数据集:AI2D,类型:ai2d,指标:准确率81.7%
- 数据集:ChartQA,类型:chartqa,指标:准确率84.9%
- 数据集:DocVQA,类型:docvqa,指标:准确率94.2%
- 数据集:InfoVQA,类型:infovqa,指标:准确率75.1%
- 数据集:MathVerse,类型:mathverse,指标:准确率40.9%
- 数据集:MathVista,类型:mathvista,指标:准确率65.5%
- 数据集:MMBench,类型:mmbench,指标:准确率83%
- 数据集:MME,类型:mme,指标:分数2317
- 数据集:MMVet,类型:mmvet,指标:准确率59.4%
- 数据集:OCRBench,类型:ocrbench,指标:准确率814
- 数据集:RealWorldQA,类型:realworldqa,指标:准确率67.5%
- 数据集:Seed - Bench - Image,类型:seed - bench - image,指标:准确率75.5%
- 数据集:Science - QA,类型:science - qa,指标:准确率96.4%
- 数据集:TextVQA,类型:textvqa,指标:准确率78%
- 任务名称:自动语音识别,类型:automatic - speech - recognition
- 数据集:LibriSpeech (clean),类型:librispeech_asr,配置:clean,分割:test,参数:语言为英文,指标:测试字错率4.1%
📄 许可证
本项目采用 apache - 2.0 许可证。
📖 引用
@article{chen2024emova,
title={Emova: Empowering language models to see, hear and speak with vivid emotions},
author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
journal={arXiv preprint arXiv:2409.18042},
year={2024}
}
项目链接