InternLM-XComposer2.5-OL开源多模态系统 - 支持长时流视频与音频交互

首页

Internlm Xcomposer2d5 Ol 7b

由 internlm 开发

InternLM-XComposer2.5-OL是一个支持长时流式视频与音频交互的全方位多模态系统。

文本生成图像

Safetensors

开源协议:其他 #长时流式音视频交互 #全方位多模态系统 #音频理解

下载量 79

发布时间 : 12/11/2024

模型简介

该模型是一个多模态系统，支持长时流式视频与音频交互，能够处理图像理解和音频理解等多种任务。

模型特点

多模态交互

支持图像和音频的多模态输入与交互。

长时流式处理

能够处理长时流式视频与音频数据。

高效推理

支持高效的推理速度，适用于实时应用。

模型能力

图像理解

音频理解

语音识别

多模态交互

使用案例

多媒体分析

图像内容分析

分析图像中的内容，提供详细的描述和分析。

能够准确识别图像中的物体和场景。

语音识别

识别语音内容并转换为文本。

支持多种语言的语音识别。

实时交互

实时视频分析

处理实时视频流，提供即时分析结果。

适用于监控和实时反馈系统。

🚀 InternLM-XComposer-2.5-OL

InternLM-XComposer-2.5-OL 是一个用于长期流式视频和音频交互的综合多模态系统，为相关领域的应用提供了强大的支持。

InternLM-XComposer-2.5-OL

[💻Github 仓库](https://github.com/InternLM/InternLM-XComposer)

🚀 快速开始

我们提供了以下简单示例，展示如何使用 🤗 Transformers 来使用 InternLM-XComposer-2.5-OL。完整指南请参考此处。

💻 使用示例

基础用法

以下是使用 Transformers 加载基础大语言模型的代码：

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# 初始化模型和分词器
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer

以下是使用 MS-Swift 加载基础音频模型的代码：

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)
from swift.utils import seed_everything

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
                                       model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

高级用法

音频理解

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)
from swift.utils import seed_everything

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
                                       model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

# 中文自动语音识别
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')

图像理解

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# 初始化模型和分词器
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer

query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)

📄 许可证

代码采用 Apache 2.0 许可证，而模型权重完全开放用于学术研究，也允许免费商业使用。如需申请商业许可证，请填写申请表（[英文](application form)/中文）。如有其他问题或合作需求，请联系 internlm@pjlab.org.cn。

引用

如果您发现 InternLM-XComposer-2.5-OL 对您的研究和应用有帮助，请使用以下 BibTeX 进行引用：

@misc{zhang2024internlmxcomposer25omnilivecomprehensivemultimodallongterm,
      title={InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions}, 
      author={Pan Zhang and Xiaoyi Dong and Yuhang Cao and Yuhang Zang and Rui Qian and Xilin Wei and Lin Chen and Yifei Li and Junbo Niu and Shuangrui Ding and Qipeng Guo and Haodong Duan and Xin Chen and Han Lv and Zheng Nie and Min Zhang and Bin Wang and Wenwei Zhang and Xinyue Zhang and Jiaye Ge and Wei Li and Jingwen Li and Zhongying Tu and Conghui He and Xingcheng Zhang and Kai Chen and Yu Qiao and Dahua Lin and Jiaqi Wang},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.09596}, 
}