ChatRex-7B开源多模态大语言模型 - 结合感知关联答案到具体对象！

首页

Chatrex 7B

由 IDEA-Research 开发

ChatRex是一款擅长感知的多模态大语言模型，能在回答问题的同时将答案关联到具体对象。

图像生成文本

Safetensors

英语#细粒度物体感知 #多模态对话 #区域理解

下载量 825

发布时间 : 11/25/2024

模型简介

ChatRex是一款多模态大语言模型(MLLM)，旨在无缝整合细粒度物体感知与强大的语言理解能力。通过采用解耦架构结合基于检索的目标检测方法，并利用高分辨率视觉输入，ChatRex解决了感知任务中的关键挑战。

模型特点

细粒度物体感知

能够将答案关联到图像中的具体对象，实现细粒度的物体感知。

多模态整合

无缝整合视觉和语言理解能力，支持多种视觉语言任务。

高分辨率视觉输入

利用高分辨率视觉输入，提升感知任务的准确性。

通用建议网络(UPN)

采用双粒度提示调优策略的DETR架构，结合细粒度和粗粒度检测能力。

模型能力

目标检测

基于实体的对话

基于实体的图像描述

区域理解

多模态问答

使用案例

视觉问答

目标检测与实体关联

检测图像中的特定对象并将答案关联到具体对象。

能够准确检测并关联图像中的对象。

图像描述

区域描述生成

生成图像中特定区域的描述。

能够生成准确且详细的区域描述。

基于实体的图像描述

生成包含实体索引的图像描述。

生成的描述中包含所有提及对象的索引。

对话系统

基于实体的对话

在对话中将答案关联到图像中的具体对象。

能够实现基于实体的自然对话。

🚀 ChatRex

ChatRex是一个具备强大感知能力的多模态大语言模型（MLLM），它能够在回答问题的同时，将答案与所引用的对象进行关联，适用于物体检测、图像理解等多种需要细粒度感知的场景。

论文链接

arxiv.org/abs/2411.18363

项目图示

🚀 快速开始

ChatRex是一个多模态大语言模型（MLLM），旨在无缝集成细粒度对象感知和强大的语言理解能力。通过采用解耦架构和基于检索的对象检测方法，并利用高分辨率视觉输入，ChatRex解决了感知任务中的关键挑战。它由具有多样化图像区域文本注释的Rexverse - 2M数据集提供支持。ChatRex可应用于各种需要细粒度感知的场景，如对象检测、基于图像的对话、基于图像的描述和区域理解。

✨ 主要特性

细粒度感知与语言理解融合：ChatRex能够无缝整合细粒度的对象感知和强大的语言理解能力，有效应对感知任务中的关键挑战。
多样化应用场景：可应用于对象检测、基于图像的对话、基于图像的描述和区域理解等多种需要细粒度感知的场景。
基于丰富数据集：由Rexverse - 2M数据集提供支持，该数据集具有多样化的图像区域文本注释。

📦 安装指南

环境安装

conda install -n chatrex python=3.9
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
git clone https://github.com/IDEA-Research/ChatRex.git
cd ChatRex
pip install -v -e .
# 安装通用提议网络的可变形注意力
cd chatrex/upn/ops
pip install -v -e .

下载预训练的UPN模型

我们提供了通用提议网络（UPN） 和 ChatRex模型 的检查点。你可以从以下链接下载预训练模型：

或者，你也可以使用以下命令下载预训练模型：

mkdir checkpoints
mkdir checkpoints/upn
# 下载UPN检查点
wget -O checkpoints/upn/upn_large.pth https://github.com/IDEA-Research/ChatRex/releases/download/upn-large/upn_large.pth

验证安装

验证通用提议网络（UPN）的安装

运行以下命令：

python tests/test_upn_install.py

如果安装成功，你将在tests文件夹中获得细粒度提议和粗粒度提议的两个可视化图像。

验证ChatRex模型的安装

运行以下命令：

python tests/test_chatrex_install.py

如果安装成功，你将得到如下输出：

prediction: <obj0> shows a brown dog lying on a bed. The dog is resting comfortably, possibly sleeping, and is positioned on the left side of the bed

💻 使用示例

UPN用于对象提议生成

通用提议网络（UPN）是ChatRex的一部分，是一个强大的对象提议模型，旨在实现跨不同粒度和领域的全面、准确的对象检测。UPN基于T - Rex2构建，是一个基于DETR的模型，采用双粒度提示调整策略，结合了细粒度（如部件级）和粗粒度（如实例级）检测。

UPN示例代码

import torch
from PIL import Image
from tools.visualize import plot_boxes_to_image
from chatrex.upn import UPNWrapper

ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
test_image_path = "tests/images/test_upn.jpeg"

model = UPNWrapper(ckpt_path)
# 细粒度提示
fine_grained_proposals = model.inference(
    test_image_path, prompt_type="fine_grained_prompt"
)
# 按分数过滤（默认：0.3）和非极大值抑制（默认：0.8）
fine_grained_filtered_proposals = model.filter(
    fine_grained_proposals, min_score=0.3, nms_value=0.8
)
## 输出是一个字典，键为："original_xyxy_boxes", "scores"
## - "original_xyxy_boxes": 以xyxy格式表示的框列表，形状为 (B, N, 4)
## - "scores": 每个框的分数列表，形状为 (B, N)

# 粗粒度提示
coarse_grained_proposals = model.inference(
    test_image_path, prompt_type="coarse_grained_prompt"
)
coarse_grained_filtered_proposals = model.filter(
    coarse_grained_proposals, min_score=0.3, nms_value=0.8
)

## 输出是一个字典，键为："original_xyxy_boxes", "scores"
## - "original_xyxy_boxes": 以xyxy格式表示的框列表，形状为 (B, N, 4)
## - "scores": 每个框的分数列表，形状为 (B, N)

UPN可视化示例代码

from chatrex.tools.visualize import plot_boxes_to_image
image = Image.open(test_image_path)
fine_grained_vis_image, _ = plot_boxes_to_image(
    image.copy(),
    fine_grained_filtered_proposals["original_xyxy_boxes"][0],
    fine_grained_filtered_proposals["scores"][0],
)
fine_grained_vis_image.save("tests/test_image_fine_grained.jpeg")
print(f"细粒度提议保存于 tests/test_image_fine_grained.jpeg")

coarse_grained_vis_image, _ = plot_boxes_to_image(
    image.copy(),
    coarse_grained_filtered_proposals["original_xyxy_boxes"][0],
    coarse_grained_filtered_proposals["scores"][0],
)
coarse_grained_vis_image.save("tests/test_image_coarse_grained.jpeg")
print(f"粗粒度提议保存于 tests/test_image_coarse_grained.jpeg")

ChatRex的使用

ChatRex接受三个输入：图像、文本提示和框输入。对于框输入，你可以使用UPN生成的对象提议，也可以提供自己的框输入（用户绘制的框）。我们已将ChatRex模型封装为Hugging Face Transformers格式，方便使用。ChatRex可用于各种任务，以下为每个任务提供了示例代码。

ChatRex用于对象检测、定位和引用

检测、定位、引用任务的示例提示：

# 单对象检测
Please detect dog in this image. Answer the question with object indexes.
Please detect the man in yellow shirt in this image. Answer the question with object indexes.

# 多对象检测，使用 ; 分隔对象
Please detect person; pigeon in this image. Answer the question with object indexes.
Please detect person in the car; cat below the table in this image. Answer the question with object indexes.

示例代码

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper

if __name__ == "__main__":
    # 加载处理器
    processor = AutoProcessor.from_pretrained(
        "IDEA-Research/ChatRex-7B",
        trust_remote_code=True,
        device_map="cuda",
    )

    print(f"正在加载ChatRex模型...")
    # 加载ChatRex模型
    model = AutoModelForCausalLM.from_pretrained(
        "IDEA-Research/ChatRex-7B",
        trust_remote_code=True,
        use_safetensors=True,
    ).to("cuda")

    # 加载UPN模型
    print(f"正在加载UPN模型...")
    ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
    model_upn = UPNWrapper(ckpt_path)
    test_image_path = "tests/images/test_chatrex_detection.jpg"

    # 获取UPN预测
    fine_grained_proposals = model_upn.inference(
        test_image_path, prompt_type="fine_grained_prompt"
    )
    fine_grained_filtered_proposals = model_upn.filter(
        fine_grained_proposals, min_score=0.3, nms_value=0.8
    )

    inputs = processor.process(
        image=Image.open(test_image_path),
        question="Please detect person; pigeon in this image. Answer the question with object indexes.",
        bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
            0
        ],  # 框以xyxy格式表示
    )

    inputs = {k: v.to("cuda") for k, v in inputs.items()}

    # 进行推理
    gen_config = GenerationConfig(
        max_new_tokens=512,
        do_sample=False,
        eos_token_id=processor.tokenizer.eos_token_id,
        pad_token_id=(
            processor.tokenizer.pad_token_id
            if processor.tokenizer.pad_token_id is not None
            else processor.tokenizer.eos_token_id
        ),
    )
    with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
        prediction = model.generate(
            inputs, gen_config=gen_config, tokenizer=processor.tokenizer
        )
    print(f"prediction:", prediction)

    # 可视化预测结果
    vis_image = visualize_chatrex_output(
        Image.open(test_image_path),
        fine_grained_filtered_proposals["original_xyxy_boxes"][0],
        prediction,
        font_size=15,
        draw_width=5,
    )
    vis_image.save("tests/test_chatrex_detection.jpeg")
    print(f"预测结果保存于 tests/test_chatrex_detection.jpeg")

大语言模型的输出如下：

<ground>person</ground><objects><obj10><obj14><obj15><obj27><obj28><obj32><obj33><obj35><obj38><obj47><obj50></objects>
<ground>pigeon</ground><objects><obj0><obj1><obj2><obj3><obj4><obj5><obj6><obj7><obj8><obj9><obj11><obj12><obj13><obj16><obj17><obj18><obj19><obj20><obj21><obj22><obj23><obj24><obj25><obj26><obj29><obj31><obj37><obj39><obj40><obj41><obj44><obj49></objects>

输出的可视化结果如下：

ChatRex用于区域描述

区域描述任务的示例提示：

# 单对象检测
## 以类别名称描述
What is the category name of <obji>? Answer the question with its category name in free format.

## 以短短语描述
Can you provide me with a short phrase to describe <obji>? Answer the question with a short phrase.

## 以引用风格描述
Can you provide me with a brief description of <obji>? Answer the question with brief description.

## 以一句话描述
Can you provide me with a one sentence of <obji>? Answer the question with one sentence description.

# 多对象检测，使用 ; 分隔对象

示例代码

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper

if __name__ == "__main__":
    # 加载处理器
    processor = AutoProcessor.from_pretrained(
        "IDEA-Research/ChatRex-7B",
        trust_remote_code=True,
        device_map="cuda",
    )

    print(f"正在加载ChatRex模型...")
    # 加载ChatRex模型
    model = AutoModelForCausalLM.from_pretrained(
        "IDEA-Research/ChatRex-7B",
        trust_remote_code=True,
        use_safetensors=True,
    ).to("cuda")

    test_image_path = "tests/images/test_chatrex_install.jpg"

    inputs = processor.process(
        image=Image.open(test_image_path),
        question="Can you provide a one sentence description of <obj0> in the image? Answer the question with a one sentence description.",
        bbox=[[73.88417, 56.62228, 227.69223, 216.34338]],
    )

    inputs = {k: v.to("cuda") for k, v in inputs.items()}

    # 进行推理
    gen_config = GenerationConfig(
        max_new_tokens=512,
        do_sample=False,
        eos_token_id=processor.tokenizer.eos_token_id,
        pad_token_id=(
            processor.tokenizer.pad_token_id
            if processor.tokenizer.pad_token_id is not None
            else processor.tokenizer.eos_token_id
        ),
    )
    with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
        prediction = model.generate(
            inputs, gen_config=gen_config, tokenizer=processor.tokenizer
        )
    print(f"prediction:", prediction)

    # 可视化预测结果
    vis_image = visualize_chatrex_output(
        Image.open(test_image_path),
        [[73.88417, 56.62228, 227.69223, 216.34338]],
        prediction,
        font_size=15,
        draw_width=5,
    )
    vis_image.save("tests/test_chatrex_region_caption.jpeg")
    print(f"预测结果保存于 tests/test_chatrex_region_caption.jpeg")

大语言模型的输出如下：

<ground>A brown dog is lying on a bed, appearing relaxed and comfortable</ground><objects><obj0></objects>

输出的可视化结果如下：

ChatRex用于基于图像的描述

区域描述任务的示例提示：

# 简要的基于图像的描述
Please breifly describe this image in one sentence and detect all the mentioned objects. Answer the question with grounded answer.

# 详细的基于图像的描述
Please provide a detailed description of the image and detect all the mentioned objects. Answer the question with grounded object indexes.

示例代码

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper

if __name__ == "__main__":
    # 加载处理器
    processor = AutoProcessor.from_pretrained(
        "IDEA-Research/ChatRex-7B",
        trust_remote_code=True,
        device_map="cuda",
    )

    print(f"正在加载ChatRex模型...")
    # 加载ChatRex模型
    model = AutoModelForCausalLM.from_pretrained(
        "IDEA-Research/ChatRex-7B",
        trust_remote_code=True,
        use_safetensors=True,
    ).to("cuda")

    # 加载UPN模型
    print(f"正在加载UPN模型...")
    ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
    model_upn = UPNWrapper(ckpt_path)
    test_image_path = "tests/images/test_chatrex_grounded_caption.jpg"

    # 获取UPN预测
    fine_grained_proposals = model_upn.inference(
        test_image_path, prompt_type="fine_grained_prompt"
    )
    fine_grained_filtered_proposals = model_upn.filter(
        fine_grained_proposals, min_score=0.3, nms_value=0.8
    )

    inputs = processor.process(
        image=Image.open(test_image_path),
        question="Please breifly describe this image in one sentence and detect all the mentioned objects. Answer the question with grounded answer.",
        bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
            0
        ],  # 框以xyxy格式表示
    )

    inputs = {k: v.to("cuda") for k, v in inputs.items()}

    # 进行推理
    gen_config = GenerationConfig(
        max_new_tokens=512,
        do_sample=False,
        eos_token_id=processor.tokenizer.eos_token_id,
        pad_token_id=(
            processor.tokenizer.pad_token_id
            if processor.tokenizer.pad_token_id is not None
            else processor.tokenizer.eos_token_id
        ),
    )
    with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
        prediction = model.generate(
            inputs, gen_config=gen_config, tokenizer=processor.tokenizer
        )
    print(f"prediction:", prediction)

    # 可视化预测结果
    vis_image = visualize_chatrex_output(
        Image.open(test_image_path),
        fine_grained_filtered_proposals["original_xyxy_boxes"][0],
        prediction,
        font_size=15,
        draw_width=5,
    )
    vis_image.save("tests/test_chatrex_grounded_image_caption.jpeg")
    print(f"预测结果保存于 tests/test_chatrex_grounded_image_caption.jpeg")

大语言模型的输出如下：

The image depicts a cozy living room with a <ground>plaid couch,</ground><objects><obj2></objects> a <ground>wooden TV stand</ground><objects><obj3></objects>holding a <ground>black television,</ground><objects><obj1></objects> a <ground>red armchair,</ground><objects><obj4></objects> and a <ground>whiteboard</ground><objects><obj0></objects>with writing on the wall, accompanied by a <ground>framed poster</ground><objects><obj6></objects>of a <ground>couple.</ground><objects><obj9><obj11></objects>

输出的可视化结果如下：

ChatRex用于基于图像的对话

区域描述任务的示例提示：

Answer the question in Grounded format. Question

示例代码

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper

if __name__ == "__main__":
    # 加载处理器
    processor = AutoProcessor.from_pretrained(
        "IDEA-Research/ChatRex-7B",
        trust_remote_code=True,
        device_map="cuda",
    )

    print(f"正在加载ChatRex模型...")
    # 加载ChatRex模型
    model = AutoModelForCausalLM.from_pretrained(
        "IDEA-Research/ChatRex-7B",
        trust_remote_code=True,
        use_safetensors=True,
    ).to("cuda")

    # 加载UPN模型
    print(f"正在加载UPN模型...")
    ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
    model_upn = UPNWrapper(ckpt_path)
    test_image_path = "tests/images/test_grounded_conversation.jpg"

    # 获取UPN预测
    fine_grained_proposals = model_upn.inference(
        test_image_path, prompt_type="coarse_grained_prompt"
    )
    fine_grained_filtered_proposals = model_upn.filter(
        fine_grained_proposals, min_score=0.3, nms_value=0.8
    )

    inputs = processor.process(
        image=Image.open(test_image_path),
        question="Answer the question in grounded format. This is a photo of my room, and can you tell me what kind of person I am?  ",
        bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
            0
        ],  # 框以xyxy格式表示
    )

    inputs = {k: v.to("cuda") for k, v in inputs.items()}

    # 进行推理
    gen_config = GenerationConfig(
        max_new_tokens=512,
        do_sample=False,
        eos_token_id=processor.tokenizer.eos_token_id,
        pad_token_id=(
            processor.tokenizer.pad_token_id
            if processor.tokenizer.pad_token_id is not None
            else processor.tokenizer.eos_token_id
        ),
    )
    with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
        prediction = model.generate(
            inputs, gen_config=gen_config, tokenizer=processor.tokenizer
        )
    print(f"prediction:", prediction)

    # 可视化预测结果
    vis_image = visualize_chatrex_output(
        Image.open(test_image_path),
        fine_grained_filtered_proposals["original_xyxy_boxes"][0],
        prediction,
        font_size=30,
        draw_width=10,
    )
    vis_image.save("tests/test_chatrex_grounded_conversation.jpeg")
    print(f"预测结果保存于 tests/test_chatrex_grounded_conversation.jpeg")

大语言模型的输出如下：

Based on the items in the image, it can be inferred that the <ground>person</ground><objects><obj1></objects> who owns this room has an interest in fitness and possibly enjoys reading. The presence of the <ground>dumbbell</ground><objects><obj2></objects> suggests a commitment to physical activity, while the <ground>book</ground><objects><obj3></objects> indicates a liking for literature or reading. The <ground>sneaker</ground><objects><obj0></objects>s and the <ground>plush toy</ground><objects><obj1></objects> add a personal touch, suggesting that the <ground>person</ground><objects><obj1></objects> might also value comfort and perhaps has a playful or nostalgic side. However, without more context, it is not possible to accurately determine the individual's specific traits or <ground>person</ground><objects><obj1></objects>ality.

输出的可视化结果如下：

📄 许可证

数据集的OpenAI使用条款。
本项目中使用的大语言模型是lmsys/vicuna - 7b - v1.5，其采用Llama 2社区许可协议。
高分辨率视觉编码器使用的是laion/CLIP - convnext_large_d.laion2B - s26B - b102K - augreg，其采用MIT许可证。
低分辨率视觉编码器使用的是openai/clip - vit - large - patch14，其采用MIT许可证。

📚 BibTeX引用

@misc{jiang2024chatrextamingmultimodalllm,
      title={ChatRex: Taming Multimodal LLM for Joint Perception and Understanding}, 
      author={Qing Jiang and Gen Luo and Yuqin Yang and Yuda Xiong and Yihao Chen and Zhaoyang Zeng and Tianhe Ren and Lei Zhang},
      year={2024},
      eprint={2411.18363},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.18363}, 
}