license: other
license_name: cogvlm2
license_link: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/blob/main/LICENSE
language:
- en
pipeline_tag: text-generation
tags:
- chat
- cogvlm2
inference: false
CogVLM2
👋 微信 · 💡在线演示 · 🎈GitHub主页 · 📑 论文
📍在智谱AI开放平台体验更大规模的CogVLM模型。
模型介绍
我们推出了新一代CogVLM2系列模型,并开源了基于Meta-Llama-3-8B-Instruct构建的两款模型。相比上一代CogVLM开源模型,CogVLM2系列开源模型具有以下改进:
- 在
TextVQA
、DocVQA
等多项基准测试中显著提升。
- 支持8K上下文长度。
- 支持最高1344*1344分辨率的图像输入。
- 提供同时支持中英双语的开源模型版本。
下表展示了CogVLM2系列开源模型的详细信息:
模型名称 |
cogvlm2-llama3-chat-19B |
cogvlm2-llama3-chinese-chat-19B |
基础模型 |
Meta-Llama-3-8B-Instruct |
Meta-Llama-3-8B-Instruct |
语言支持 |
英文 |
中文、英文 |
模型规模 |
19B |
19B |
任务类型 |
图像理解、对话模型 |
图像理解、对话模型 |
文本长度 |
8K |
8K |
图像分辨率 |
1344*1344 |
1344*1344 |
性能基准
我们的开源模型在多项榜单中相较上一代CogVLM开源模型取得了优异成绩,其出色表现可与部分非开源模型媲美,具体数据如下表所示:
模型 |
是否开源 |
大模型规模 |
TextVQA |
DocVQA |
ChartQA |
OCRbench |
VCR_EASY |
VCR_HARD |
MMMU |
MMVet |
MMBench |
CogVLM1.1 |
✅ |
7B |
69.7 |
- |
68.3 |
590 |
73.9 |
34.6 |
37.3 |
52.0 |
65.8 |
LLaVA-1.5 |
✅ |
13B |
61.3 |
- |
- |
337 |
- |
- |
37.0 |
35.4 |
67.7 |
Mini-Gemini |
✅ |
34B |
74.1 |
- |
- |
- |
- |
- |
48.0 |
59.3 |
80.6 |
LLaVA-NeXT-LLaMA3 |
✅ |
8B |
- |
78.2 |
69.5 |
- |
- |
- |
41.7 |
- |
72.1 |
LLaVA-NeXT-110B |
✅ |
110B |
- |
85.7 |
79.7 |
- |
- |
- |
49.1 |
- |
80.5 |
InternVL-1.5 |
✅ |
20B |
80.6 |
90.9 |
83.8 |
720 |
14.7 |
2.0 |
46.8 |
55.4 |
82.3 |
QwenVL-Plus |
❌ |
- |
78.9 |
91.4 |
78.1 |
726 |
- |
- |
51.4 |
55.7 |
67.0 |
Claude3-Opus |
❌ |
- |
- |
89.3 |
80.8 |
694 |
63.85 |
37.8 |
59.4 |
51.7 |
63.3 |
Gemini Pro 1.5 |
❌ |
- |
73.5 |
86.5 |
81.3 |
- |
62.73 |
28.1 |
58.5 |
- |
- |
GPT-4V |
❌ |
- |
78.0 |
88.4 |
78.5 |
656 |
52.04 |
25.8 |
56.8 |
67.7 |
75.0 |
CogVLM2-LLaMA3 |
✅ |
8B |
84.2 |
92.3 |
81.0 |
756 |
83.3 |
38.0 |
44.3 |
60.4 |
80.5 |
CogVLM2-LLaMA3-Chinese |
✅ |
8B |
85.0 |
88.4 |
74.7 |
780 |
79.9 |
25.1 |
42.8 |
60.5 |
78.9 |
所有评测均未使用任何外部OCR工具("仅像素级输入")。
快速开始
以下是与CogVLM2模型进行对话的简单示例。更多使用案例请访问我们的GitHub:
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
).to(DEVICE).eval()
text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
while True:
image_path = input("请输入图片路径 >>>>> ")
if image_path == '':
print('未输入图片路径,以下将进行纯文本对话。')
image = None
text_only_first_query = True
else:
image = Image.open(image_path).convert('RGB')
history = []
while True:
query = input("用户提问:")
if query == "clear":
break
if image is None:
if text_only_first_query:
query = text_only_template.format(query)
text_only_first_query = False
else:
old_prompt = ''
for _, (old_query, response) in enumerate(history):
old_prompt += old_query + " " + response + "\n"
query = old_prompt + "USER: {} ASSISTANT:".format(query)
if image is None:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
template_version='chat'
)
else:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=[image],
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
}
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0])
response = response.split("<|end_of_text|>")[0]
print("\nCogVLM2:", response)
history.append((query, response))
许可协议
本模型遵循CogVLM2许可协议。对于基于Meta Llama 3构建的模型,请同时遵守LLAMA3许可协议。
引用
如果您觉得我们的工作有帮助,请考虑引用以下论文:
@misc{hong2024cogvlm2,
title={CogVLM2: Visual Language Models for Image and Video Understanding},
author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others},
year={2024}
eprint={2408.16500},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{wang2023cogvlm,
title={CogVLM: Visual Expert for Pretrained Language Models},
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2311.03079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}