Cephalo-Idefics-2-vision-8b-alpha开源模型 - 多模态材料科学促进人机高级交互

首页

Cephalo Idefics 2 Vision 8b Alpha

由 lamm-mit 开发

Cephalo是一系列专注于多模态材料科学的视觉大语言模型（V-LLMs），旨在整合视觉和语言数据，以促进人机交互或多智能体AI框架中的高级理解和互动。

图像生成文本

Transformers

其他开源协议:Apache-2.0 #多模态材料科学 #视觉语言理解 #科学图像分析

下载量 150

发布时间 : 5/23/2024

模型简介

Cephalo能够解释复杂的视觉场景，并生成上下文准确的语言描述和回答查询。该模型开发用于处理多样化的输入，包括图像和文本，支持广泛的应用，如图像字幕生成、视觉问答和多模态内容生成。

模型特点

多模态材料科学理解

专注于整合视觉和语言数据，特别针对材料科学领域的高级理解和互动。

创新的数据集生成方法

采用先进算法从复杂的PDF文档中准确检测和分离图像及其对应的文本描述，确保训练数据的高质量和上下文相关性。

复杂视觉场景解释

能够解释复杂的视觉场景，并生成上下文准确的语言描述和回答查询。

多智能体AI框架支持

设计用于促进人机交互或多智能体AI框架中的高级理解和互动。

模型能力

图像字幕生成

视觉问答

多模态内容生成

材料科学视觉分析

多智能体AI交互

使用案例

材料科学

材料微观结构分析

分析材料微观结构的2D和3D渲染，为增材制造方法提供输入。

提供准确的视觉描述和分析，辅助材料设计。

仿生学应用

通过分析自然界中的行为（如蚂蚁攀爬）启发材料设计和多智能体AI系统开发。

提供仿生学灵感，促进高效和适应性强的运动系统设计。

多智能体AI

多智能体协作系统

分析自然界中的协作行为（如蚂蚁群体行为），设计多智能体AI系统。

提供协作行为的视觉理解和语言描述，辅助AI系统设计。

🚀 赛法洛（Cephalo）模型

赛法洛（Cephalo）是一系列专注于材料科学的多模态视觉大语言模型（V - LLMs），旨在整合视觉和语言数据，以实现人类与AI或多智能体AI框架中的高级理解和交互。它能够处理图像和文本等多种输入，在图像描述、视觉问答和多模态内容生成等领域具有广泛应用。

🚀 快速开始

环境准备

确保你已经安装了必要的库，如torch、transformers、Pillow、requests等。

示例代码

以下是在GPU上快速开始的代码示例：

from PIL import Image 
import requests 

DEVICE='cuda:0'

from transformers import AutoProcessor, Idefics2ForConditionalGeneration 
from tqdm.notebook import tqdm
 
model_id='lamm-mit/Cephalo-Idefics-2-vision-8b-alpha'

model = Idefics2ForConditionalGeneration.from_pretrained(  model_id,
                                                           torch_dtype=torch.bfloat16, #if your GPU allows
                                                           _attn_implementation="flash_attention_2", #make sure Flash Attention 2 is installed
                                                           trust_remote_code=True,
                                                        ).to (DEVICE)
processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    do_image_splitting=True
)

更多关于模型优化（包括量化）的内容，请参考后续章节。

简单推理示例

from transformers.image_utils import load_image

image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."},
        ]
    },
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# Get inputs using the processor
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

便捷推理函数

def ask_about_image (model, processor, question, 
                     images_input=[], 
                     verbatim=False,
                     temperature=0.1,
                     show_image=False,
                     system="You are a biomaterials scientist who responds accurately. ", 
                     init_instr = "",
                     show_conversation=True,
                     max_new_tokens=256, 
                     messages=[], 
                     images=[], 
                     use_Markdown=False,
                    ):
    
   
    query = question
    images_input=ensure_list(images_input)
    if len (images)==0:
        if len (images_input)>0:
            for image in tqdm (images_input) :
                if is_url(image):
                    image= load_image(image)
                images.append (image)
                
                if show_image:
                    display ( image )
    if len (messages)==0:
       
        base_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": system + init_instr},
                # Image messages will be added dynamically here
                {"type": "text", "text": query}
            ]
        }
        
        # Ensure the images_input is a list
        images_input = ensure_list(images_input)
        
        # Add image messages dynamically
        image_messages = [{"type": "image"} for _ in images_input]
        base_message["content"][1:1] = image_messages  # Insert image messages before the last text message
        
        # Append the constructed message to messages list
        messages.append(base_message)

    else:
        messages.append (
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": query
                    }
                ]
            }
        )
    if verbatim:
        print (messages)
        
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=[text.strip()], images=images, return_tensors="pt", padding=True).to(DEVICE)
     
    generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=True)
    generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)

    messages.append (
        {
            "role": "assistant",
            "content": [ {"type": "text", "text": generated_texts[0]}, ]
        }
    )
    formatted_conversation = format_conversation(messages, images)
    
    # Display the formatted conversation, e.g. in Jupyter Notebook
    if show_conversation:
     
        if use_Markdown:
            display(Markdown(formatted_conversation))
        else:
            display(HTML(formatted_conversation))

    return generated_texts, messages, images

question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."

url1 = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg" 

response, messages,images= ask_about_image ( model, processor, question, 
                                             images_input=[url1,],
                                             temperature=0.1,
                                             system= '', init_instr='You carefully study the image, and respond accurately, but succinctly. Think step-by-step.\n\n', 
                                             show_conversation=True,
                                             max_new_tokens=512, messages=[], images=[])

示例输出

image/png 图片来源：Vaishakh Manohar

The image depicts a group of ants moving in a coordinated manner to climb a vertical surface. This behavior is known as cooperative climbing and involves the use of multiple agents working together to achieve a common goal. The relevance for materials design lies in the potential application of multi-agent AI in developing new materials with improved properties through the collaboration of multiple agents.

✨ 主要特性

多模态融合：能够整合视觉和语言数据，实现对复杂场景的理解和交互。
创新数据集生成：采用先进算法从复杂PDF文档中提取图像和文本描述，生成高质量的图像 - 文本对用于训练。
广泛应用：可用于图像描述、视觉问答、多模态内容生成等多个领域。
灵活输入：支持处理图像和文本等多种输入。

📦 安装指南

确保你已经安装了Python环境，然后可以使用以下命令安装所需的库：

pip install torch transformers Pillow requests

💻 使用示例

基础用法

上述快速开始部分的代码示例展示了如何加载模型、处理输入并进行推理。

高级用法

手动设置聊天模板

IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
tokenizer.chat_template = IDEFICS2_CHAT_TEMPLATE
processor.tokenizer = tokenizer

聊天格式示例

单轮对话：

User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant:

多轮对话：

User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi-agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:

📚 详细文档

模型概述

赛法洛（Cephalo）模型是一系列专注于材料科学的多模态视觉大语言模型，它结合了视觉编码器模型和自回归变压器，能够处理复杂的自然语言理解任务。该模型基于HuggingFaceM4/idefics2 - 8b - chatty模型开发，在从维基百科和科学论文中提取的科学文本 - 图像数据上进行训练。

数据集生成

训练视觉模型的数据集生成方法采用先进算法，从复杂PDF文档中准确检测和分离图像及其对应的文本描述。具体步骤包括从PDF中提取图像和标题，利用大语言模型（LLMs）进行自然语言处理，创建合理的图像 - 文本对，然后通过基于LLM的NLP处理对这些图像 - 文本对进行精炼和验证，确保训练数据的高质量和上下文相关性。

模型架构

模型架构结合了视觉编码器模型和自回归变压器，以处理复杂的自然语言理解任务。

模型应用

赛法洛模型可用于多种应用场景，如：

图像描述：为图像生成准确的文本描述。
视觉问答：回答关于图像的问题。
多模态内容生成：根据图像和文本输入生成相关的多模态内容。

模型优化

半精度推理

如果你的GPU支持，可以使用半精度（torch.float16或torch.bfloat16）加载和运行推理：

model = AutoModelForVision2Seq.from_pretrained(
    "lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+    torch_dtype=torch.float16,    
).to(DEVICE)

视觉编码器效率优化

如果你的GPU内存有限，可以采取以下措施：

停用图像分割：在初始化处理器（AutoProcessor.from_pretrained）时添加do_image_splitting=False。

processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    do_image_splitting=False
)

降低最大图像分辨率：在初始化处理器时添加size= {"longest_edge": 448, "shortest_edge": 378}，并可根据需要调整longest_edge的值（默认值为980），建议使用14的倍数。

processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    size= {"longest_edge": 448, "shortest_edge": 378}
)

使用Flash - attention 2加速生成

确保安装了flash - attn，并在加载模型时添加_attn_implementation="flash_attention_2"：

model = AutoModelForVision2Seq.from_pretrained(
    "lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+    torch_dtype=torch.bfloat16,    
+    _attn_implementation="flash_attention_2",
).to(DEVICE)

4位量化

使用bitsandbytes库进行4位量化，确保安装了accelerate和bitsandbytes：

+ from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForVision2Seq.from_pretrained(
    "lamm-mit/Cephalo-Idefics-2-vision-8b-alpha",
+    torch_dtype=torch.bfloat16,    
+    quantization_config=quantization_config,
).to(DEVICE)

🔧 技术细节

模型架构

赛法洛模型结合了视觉编码器模型和自回归变压器，用于处理复杂的自然语言理解任务。视觉编码器负责处理图像输入，自回归变压器则用于生成语言输出。

数据集生成

数据集生成过程采用先进算法从复杂PDF文档中提取图像和文本描述。具体步骤包括：

图像和标题提取：从PDF中提取图像和对应的标题。
自然语言处理：利用大语言模型（LLMs）对提取的文本进行处理，创建图像 - 文本对。
精炼和验证：通过基于LLM的NLP处理对图像 - 文本对进行精炼和验证，确保数据的高质量和上下文相关性。

训练数据

模型在从维基百科和科学论文中提取的科学文本 - 图像数据上进行训练。

模型优化

模型优化包括半精度推理、视觉编码器效率优化、使用Flash - attention 2加速生成和4位量化等技术，以提高模型的性能和效率。

📄 许可证

本项目采用Apache 2.0许可证。

📖 引用

请按以下格式引用本模型：

@article{Buehler_Cephalo_2024,
  title={Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design},
  author={Markus J. Buehler},
  journal={arXiv preprint arXiv:2405.19076},
  year={2024}
}