license: other
license_name: cogvlm2
license_link: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B-int4/blob/main/LICENSE
language:
- zh
pipeline_tag: text-generation
tags:
- 对话
- cogvlm2
inference: false
CogVLM2
👋 微信 · 💡在线演示 · 🎈Github主页
📍在智谱AI开放平台体验更大规模的CogVLM模型。
模型介绍
我们推出了新一代CogVLM2系列模型,并开源了基于Meta-Llama-3-8B-Instruct构建的两款模型。相比上一代CogVLM开源模型,CogVLM2系列开源模型具有以下改进:
- 在
TextVQA
、DocVQA
等多项基准测试中显著提升。
- 支持8K上下文长度。
- 支持最高1344*1344分辨率的图像。
- 提供同时支持中英文的开源模型版本。
CogVlM2 Int4模型需要16G GPU显存,且必须在配备Nvidia GPU的Linux系统上运行。
模型名称 |
cogvlm2-llama3-chat-19B-int4 |
cogvlm2-llama3-chat-19B |
所需GPU显存 |
16G |
42G |
系统要求 |
Linux(需Nvidia GPU) |
Linux(需Nvidia GPU) |
性能基准
我们的开源模型在多份榜单中相比上一代CogVLM开源模型取得了优异成绩,其出色表现可与部分非开源模型媲美,如下表所示:
模型 |
是否开源 |
LLM大小 |
TextVQA |
DocVQA |
ChartQA |
OCRbench |
MMMU |
MMVet |
MMBench |
CogVLM1.1 |
✅ |
7B |
69.7 |
- |
68.3 |
590 |
37.3 |
52.0 |
65.8 |
LLaVA-1.5 |
✅ |
13B |
61.3 |
- |
- |
337 |
37.0 |
35.4 |
67.7 |
Mini-Gemini |
✅ |
34B |
74.1 |
- |
- |
- |
48.0 |
59.3 |
80.6 |
LLaVA-NeXT-LLaMA3 |
✅ |
8B |
- |
78.2 |
69.5 |
- |
41.7 |
- |
72.1 |
LLaVA-NeXT-110B |
✅ |
110B |
- |
85.7 |
79.7 |
- |
49.1 |
- |
80.5 |
InternVL-1.5 |
✅ |
20B |
80.6 |
90.9 |
83.8 |
720 |
46.8 |
55.4 |
82.3 |
QwenVL-Plus |
❌ |
- |
78.9 |
91.4 |
78.1 |
726 |
51.4 |
55.7 |
67.0 |
Claude3-Opus |
❌ |
- |
- |
89.3 |
80.8 |
694 |
59.4 |
51.7 |
63.3 |
Gemini Pro 1.5 |
❌ |
- |
73.5 |
86.5 |
81.3 |
- |
58.5 |
- |
- |
GPT-4V |
❌ |
- |
78.0 |
88.4 |
78.5 |
656 |
56.8 |
67.7 |
75.0 |
CogVLM2-LLaMA3 (Ours) |
✅ |
8B |
84.2 |
92.3 |
81.0 |
756 |
44.3 |
60.4 |
80.5 |
CogVLM2-LLaMA3-Chinese (Ours) |
✅ |
8B |
85.0 |
88.4 |
74.7 |
780 |
42.8 |
60.5 |
78.9 |
所有评测均未使用任何外部OCR工具("仅像素级")。
快速开始
以下是使用CogVLM2模型进行对话的简单示例。更多用例请访问我们的github
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B-int4"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[
0] >= 8 else torch.float16
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
low_cpu_mem_usage=True,
).eval()
text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
while True:
image_path = input("图片路径 >>>>> ")
if image_path == '':
print('未输入图片路径,以下将进行纯文本对话。')
image = None
text_only_first_query = True
else:
image = Image.open(image_path).convert('RGB')
history = []
while True:
query = input("用户:")
if query == "clear":
break
if image is None:
if text_only_first_query:
query = text_only_template.format(query)
text_only_first_query = False
else:
old_prompt = ''
for _, (old_query, response) in enumerate(history):
old_prompt += old_query + " " + response + "\n"
query = old_prompt + "USER: {} ASSISTANT:".format(query)
if image is None:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
template_version='chat'
)
else:
input_by_model = model.build_conversation_input_ids(
tokenizer,
query=query,
history=history,
images=[image],
template_version='chat'
)
inputs = {
'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
}
gen_kwargs = {
"max_new_tokens": 2048,
"pad_token_id": 128002,
}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
response = tokenizer.decode(outputs[0])
response = response.split("<|end_of_text|>")[0]
print("\nCogVLM2:", response)
history.append((query, response))
许可协议
本模型基于CogVLM2 LICENSE发布。对于基于Meta Llama 3构建的模型,请同时遵守LLAMA3_LICENSE。
引用
如果您觉得我们的工作有帮助,请考虑引用以下论文
@misc{wang2023cogvlm,
title={CogVLM: Visual Expert for Pretrained Language Models},
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2311.03079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}