模型简介
模型特点
模型能力
使用案例
🚀 ChatRex
ChatRex是一个具备强大感知能力的多模态大语言模型(MLLM),它能够在回答问题的同时,将答案与所引用的对象进行关联,适用于物体检测、图像理解等多种需要细粒度感知的场景。
论文链接
项目图示

🚀 快速开始
ChatRex是一个多模态大语言模型(MLLM),旨在无缝集成细粒度对象感知和强大的语言理解能力。通过采用解耦架构和基于检索的对象检测方法,并利用高分辨率视觉输入,ChatRex解决了感知任务中的关键挑战。它由具有多样化图像区域文本注释的Rexverse - 2M数据集提供支持。ChatRex可应用于各种需要细粒度感知的场景,如对象检测、基于图像的对话、基于图像的描述和区域理解。

✨ 主要特性
- 细粒度感知与语言理解融合:ChatRex能够无缝整合细粒度的对象感知和强大的语言理解能力,有效应对感知任务中的关键挑战。
- 多样化应用场景:可应用于对象检测、基于图像的对话、基于图像的描述和区域理解等多种需要细粒度感知的场景。
- 基于丰富数据集:由Rexverse - 2M数据集提供支持,该数据集具有多样化的图像区域文本注释。
📦 安装指南
环境安装
conda install -n chatrex python=3.9
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
git clone https://github.com/IDEA-Research/ChatRex.git
cd ChatRex
pip install -v -e .
# 安装通用提议网络的可变形注意力
cd chatrex/upn/ops
pip install -v -e .
下载预训练的UPN模型
我们提供了通用提议网络(UPN) 和 ChatRex模型 的检查点。你可以从以下链接下载预训练模型:
或者,你也可以使用以下命令下载预训练模型:
mkdir checkpoints
mkdir checkpoints/upn
# 下载UPN检查点
wget -O checkpoints/upn/upn_large.pth https://github.com/IDEA-Research/ChatRex/releases/download/upn-large/upn_large.pth
验证安装
验证通用提议网络(UPN)的安装
运行以下命令:
python tests/test_upn_install.py
如果安装成功,你将在tests
文件夹中获得细粒度提议和粗粒度提议的两个可视化图像。
验证ChatRex模型的安装
运行以下命令:
python tests/test_chatrex_install.py
如果安装成功,你将得到如下输出:
prediction: <obj0> shows a brown dog lying on a bed. The dog is resting comfortably, possibly sleeping, and is positioned on the left side of the bed
💻 使用示例
UPN用于对象提议生成
通用提议网络(UPN)是ChatRex的一部分,是一个强大的对象提议模型,旨在实现跨不同粒度和领域的全面、准确的对象检测。UPN基于T - Rex2构建,是一个基于DETR的模型,采用双粒度提示调整策略,结合了细粒度(如部件级)和粗粒度(如实例级)检测。

UPN示例代码
import torch
from PIL import Image
from tools.visualize import plot_boxes_to_image
from chatrex.upn import UPNWrapper
ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
test_image_path = "tests/images/test_upn.jpeg"
model = UPNWrapper(ckpt_path)
# 细粒度提示
fine_grained_proposals = model.inference(
test_image_path, prompt_type="fine_grained_prompt"
)
# 按分数过滤(默认:0.3)和非极大值抑制(默认:0.8)
fine_grained_filtered_proposals = model.filter(
fine_grained_proposals, min_score=0.3, nms_value=0.8
)
## 输出是一个字典,键为:"original_xyxy_boxes", "scores"
## - "original_xyxy_boxes": 以xyxy格式表示的框列表,形状为 (B, N, 4)
## - "scores": 每个框的分数列表,形状为 (B, N)
# 粗粒度提示
coarse_grained_proposals = model.inference(
test_image_path, prompt_type="coarse_grained_prompt"
)
coarse_grained_filtered_proposals = model.filter(
coarse_grained_proposals, min_score=0.3, nms_value=0.8
)
## 输出是一个字典,键为:"original_xyxy_boxes", "scores"
## - "original_xyxy_boxes": 以xyxy格式表示的框列表,形状为 (B, N, 4)
## - "scores": 每个框的分数列表,形状为 (B, N)
UPN可视化示例代码
from chatrex.tools.visualize import plot_boxes_to_image
image = Image.open(test_image_path)
fine_grained_vis_image, _ = plot_boxes_to_image(
image.copy(),
fine_grained_filtered_proposals["original_xyxy_boxes"][0],
fine_grained_filtered_proposals["scores"][0],
)
fine_grained_vis_image.save("tests/test_image_fine_grained.jpeg")
print(f"细粒度提议保存于 tests/test_image_fine_grained.jpeg")
coarse_grained_vis_image, _ = plot_boxes_to_image(
image.copy(),
coarse_grained_filtered_proposals["original_xyxy_boxes"][0],
coarse_grained_filtered_proposals["scores"][0],
)
coarse_grained_vis_image.save("tests/test_image_coarse_grained.jpeg")
print(f"粗粒度提议保存于 tests/test_image_coarse_grained.jpeg")
ChatRex的使用
ChatRex接受三个输入:图像、文本提示和框输入。对于框输入,你可以使用UPN生成的对象提议,也可以提供自己的框输入(用户绘制的框)。我们已将ChatRex模型封装为Hugging Face Transformers格式,方便使用。ChatRex可用于各种任务,以下为每个任务提供了示例代码。
ChatRex用于对象检测、定位和引用
检测、定位、引用任务的示例提示:
# 单对象检测
Please detect dog in this image. Answer the question with object indexes.
Please detect the man in yellow shirt in this image. Answer the question with object indexes.
# 多对象检测,使用 ; 分隔对象
Please detect person; pigeon in this image. Answer the question with object indexes.
Please detect person in the car; cat below the table in this image. Answer the question with object indexes.
示例代码
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper
if __name__ == "__main__":
# 加载处理器
processor = AutoProcessor.from_pretrained(
"IDEA-Research/ChatRex-7B",
trust_remote_code=True,
device_map="cuda",
)
print(f"正在加载ChatRex模型...")
# 加载ChatRex模型
model = AutoModelForCausalLM.from_pretrained(
"IDEA-Research/ChatRex-7B",
trust_remote_code=True,
use_safetensors=True,
).to("cuda")
# 加载UPN模型
print(f"正在加载UPN模型...")
ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
model_upn = UPNWrapper(ckpt_path)
test_image_path = "tests/images/test_chatrex_detection.jpg"
# 获取UPN预测
fine_grained_proposals = model_upn.inference(
test_image_path, prompt_type="fine_grained_prompt"
)
fine_grained_filtered_proposals = model_upn.filter(
fine_grained_proposals, min_score=0.3, nms_value=0.8
)
inputs = processor.process(
image=Image.open(test_image_path),
question="Please detect person; pigeon in this image. Answer the question with object indexes.",
bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
0
], # 框以xyxy格式表示
)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
# 进行推理
gen_config = GenerationConfig(
max_new_tokens=512,
do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id,
pad_token_id=(
processor.tokenizer.pad_token_id
if processor.tokenizer.pad_token_id is not None
else processor.tokenizer.eos_token_id
),
)
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
prediction = model.generate(
inputs, gen_config=gen_config, tokenizer=processor.tokenizer
)
print(f"prediction:", prediction)
# 可视化预测结果
vis_image = visualize_chatrex_output(
Image.open(test_image_path),
fine_grained_filtered_proposals["original_xyxy_boxes"][0],
prediction,
font_size=15,
draw_width=5,
)
vis_image.save("tests/test_chatrex_detection.jpeg")
print(f"预测结果保存于 tests/test_chatrex_detection.jpeg")
大语言模型的输出如下:
<ground>person</ground><objects><obj10><obj14><obj15><obj27><obj28><obj32><obj33><obj35><obj38><obj47><obj50></objects>
<ground>pigeon</ground><objects><obj0><obj1><obj2><obj3><obj4><obj5><obj6><obj7><obj8><obj9><obj11><obj12><obj13><obj16><obj17><obj18><obj19><obj20><obj21><obj22><obj23><obj24><obj25><obj26><obj29><obj31><obj37><obj39><obj40><obj41><obj44><obj49></objects>
输出的可视化结果如下:

ChatRex用于区域描述
区域描述任务的示例提示:
# 单对象检测
## 以类别名称描述
What is the category name of <obji>? Answer the question with its category name in free format.
## 以短短语描述
Can you provide me with a short phrase to describe <obji>? Answer the question with a short phrase.
## 以引用风格描述
Can you provide me with a brief description of <obji>? Answer the question with brief description.
## 以一句话描述
Can you provide me with a one sentence of <obji>? Answer the question with one sentence description.
# 多对象检测,使用 ; 分隔对象
示例代码
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper
if __name__ == "__main__":
# 加载处理器
processor = AutoProcessor.from_pretrained(
"IDEA-Research/ChatRex-7B",
trust_remote_code=True,
device_map="cuda",
)
print(f"正在加载ChatRex模型...")
# 加载ChatRex模型
model = AutoModelForCausalLM.from_pretrained(
"IDEA-Research/ChatRex-7B",
trust_remote_code=True,
use_safetensors=True,
).to("cuda")
test_image_path = "tests/images/test_chatrex_install.jpg"
inputs = processor.process(
image=Image.open(test_image_path),
question="Can you provide a one sentence description of <obj0> in the image? Answer the question with a one sentence description.",
bbox=[[73.88417, 56.62228, 227.69223, 216.34338]],
)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
# 进行推理
gen_config = GenerationConfig(
max_new_tokens=512,
do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id,
pad_token_id=(
processor.tokenizer.pad_token_id
if processor.tokenizer.pad_token_id is not None
else processor.tokenizer.eos_token_id
),
)
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
prediction = model.generate(
inputs, gen_config=gen_config, tokenizer=processor.tokenizer
)
print(f"prediction:", prediction)
# 可视化预测结果
vis_image = visualize_chatrex_output(
Image.open(test_image_path),
[[73.88417, 56.62228, 227.69223, 216.34338]],
prediction,
font_size=15,
draw_width=5,
)
vis_image.save("tests/test_chatrex_region_caption.jpeg")
print(f"预测结果保存于 tests/test_chatrex_region_caption.jpeg")
大语言模型的输出如下:
<ground>A brown dog is lying on a bed, appearing relaxed and comfortable</ground><objects><obj0></objects>
输出的可视化结果如下:

ChatRex用于基于图像的描述
区域描述任务的示例提示:
# 简要的基于图像的描述
Please breifly describe this image in one sentence and detect all the mentioned objects. Answer the question with grounded answer.
# 详细的基于图像的描述
Please provide a detailed description of the image and detect all the mentioned objects. Answer the question with grounded object indexes.
示例代码
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper
if __name__ == "__main__":
# 加载处理器
processor = AutoProcessor.from_pretrained(
"IDEA-Research/ChatRex-7B",
trust_remote_code=True,
device_map="cuda",
)
print(f"正在加载ChatRex模型...")
# 加载ChatRex模型
model = AutoModelForCausalLM.from_pretrained(
"IDEA-Research/ChatRex-7B",
trust_remote_code=True,
use_safetensors=True,
).to("cuda")
# 加载UPN模型
print(f"正在加载UPN模型...")
ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
model_upn = UPNWrapper(ckpt_path)
test_image_path = "tests/images/test_chatrex_grounded_caption.jpg"
# 获取UPN预测
fine_grained_proposals = model_upn.inference(
test_image_path, prompt_type="fine_grained_prompt"
)
fine_grained_filtered_proposals = model_upn.filter(
fine_grained_proposals, min_score=0.3, nms_value=0.8
)
inputs = processor.process(
image=Image.open(test_image_path),
question="Please breifly describe this image in one sentence and detect all the mentioned objects. Answer the question with grounded answer.",
bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
0
], # 框以xyxy格式表示
)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
# 进行推理
gen_config = GenerationConfig(
max_new_tokens=512,
do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id,
pad_token_id=(
processor.tokenizer.pad_token_id
if processor.tokenizer.pad_token_id is not None
else processor.tokenizer.eos_token_id
),
)
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
prediction = model.generate(
inputs, gen_config=gen_config, tokenizer=processor.tokenizer
)
print(f"prediction:", prediction)
# 可视化预测结果
vis_image = visualize_chatrex_output(
Image.open(test_image_path),
fine_grained_filtered_proposals["original_xyxy_boxes"][0],
prediction,
font_size=15,
draw_width=5,
)
vis_image.save("tests/test_chatrex_grounded_image_caption.jpeg")
print(f"预测结果保存于 tests/test_chatrex_grounded_image_caption.jpeg")
大语言模型的输出如下:
The image depicts a cozy living room with a <ground>plaid couch,</ground><objects><obj2></objects> a <ground>wooden TV stand</ground><objects><obj3></objects>holding a <ground>black television,</ground><objects><obj1></objects> a <ground>red armchair,</ground><objects><obj4></objects> and a <ground>whiteboard</ground><objects><obj0></objects>with writing on the wall, accompanied by a <ground>framed poster</ground><objects><obj6></objects>of a <ground>couple.</ground><objects><obj9><obj11></objects>
输出的可视化结果如下:

ChatRex用于基于图像的对话
区域描述任务的示例提示:
Answer the question in Grounded format. Question
示例代码
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper
if __name__ == "__main__":
# 加载处理器
processor = AutoProcessor.from_pretrained(
"IDEA-Research/ChatRex-7B",
trust_remote_code=True,
device_map="cuda",
)
print(f"正在加载ChatRex模型...")
# 加载ChatRex模型
model = AutoModelForCausalLM.from_pretrained(
"IDEA-Research/ChatRex-7B",
trust_remote_code=True,
use_safetensors=True,
).to("cuda")
# 加载UPN模型
print(f"正在加载UPN模型...")
ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
model_upn = UPNWrapper(ckpt_path)
test_image_path = "tests/images/test_grounded_conversation.jpg"
# 获取UPN预测
fine_grained_proposals = model_upn.inference(
test_image_path, prompt_type="coarse_grained_prompt"
)
fine_grained_filtered_proposals = model_upn.filter(
fine_grained_proposals, min_score=0.3, nms_value=0.8
)
inputs = processor.process(
image=Image.open(test_image_path),
question="Answer the question in grounded format. This is a photo of my room, and can you tell me what kind of person I am? ",
bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
0
], # 框以xyxy格式表示
)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
# 进行推理
gen_config = GenerationConfig(
max_new_tokens=512,
do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id,
pad_token_id=(
processor.tokenizer.pad_token_id
if processor.tokenizer.pad_token_id is not None
else processor.tokenizer.eos_token_id
),
)
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
prediction = model.generate(
inputs, gen_config=gen_config, tokenizer=processor.tokenizer
)
print(f"prediction:", prediction)
# 可视化预测结果
vis_image = visualize_chatrex_output(
Image.open(test_image_path),
fine_grained_filtered_proposals["original_xyxy_boxes"][0],
prediction,
font_size=30,
draw_width=10,
)
vis_image.save("tests/test_chatrex_grounded_conversation.jpeg")
print(f"预测结果保存于 tests/test_chatrex_grounded_conversation.jpeg")
大语言模型的输出如下:
Based on the items in the image, it can be inferred that the <ground>person</ground><objects><obj1></objects> who owns this room has an interest in fitness and possibly enjoys reading. The presence of the <ground>dumbbell</ground><objects><obj2></objects> suggests a commitment to physical activity, while the <ground>book</ground><objects><obj3></objects> indicates a liking for literature or reading. The <ground>sneaker</ground><objects><obj0></objects>s and the <ground>plush toy</ground><objects><obj1></objects> add a personal touch, suggesting that the <ground>person</ground><objects><obj1></objects> might also value comfort and perhaps has a playful or nostalgic side. However, without more context, it is not possible to accurately determine the individual's specific traits or <ground>person</ground><objects><obj1></objects>ality.
输出的可视化结果如下:

📄 许可证
ChatRex采用IDEA许可证1.0,版权所有 (c) IDEA。保留所有权利。请注意,本项目使用了某些数据集和检查点,这些数据集和检查点受其各自的原始许可证约束。用户必须遵守这些原始许可证的所有条款和条件,包括但不限于:
- 数据集的OpenAI使用条款。
- 本项目中使用的大语言模型是lmsys/vicuna - 7b - v1.5,其采用Llama 2社区许可协议。
- 高分辨率视觉编码器使用的是laion/CLIP - convnext_large_d.laion2B - s26B - b102K - augreg,其采用MIT许可证。
- 低分辨率视觉编码器使用的是openai/clip - vit - large - patch14,其采用MIT许可证。
📚 BibTeX引用
@misc{jiang2024chatrextamingmultimodalllm,
title={ChatRex: Taming Multimodal LLM for Joint Perception and Understanding},
author={Qing Jiang and Gen Luo and Yuqin Yang and Yuda Xiong and Yihao Chen and Zhaoyang Zeng and Tianhe Ren and Lei Zhang},
year={2024},
eprint={2411.18363},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.18363},
}








