Cephalo-Idefics-2-vision-10b-alpha开源模型 - 助力多模态材料科学人机交互！

首页

Cephalo Idefics 2 Vision 10b Alpha

由 lamm-mit 开发

Cephalo是一系列专注于多模态材料科学的视觉大语言模型（V-LLMs），旨在整合视觉和语言数据，以促进人机交互或多智能体AI框架中的高级理解和交互。

图像生成文本

Transformers

其他开源协议:Apache-2.0 #多模态材料科学 #视觉语言理解 #仿生设计

下载量 137

发布时间 : 5/28/2024

模型简介

Cephalo能够解释复杂的视觉场景，并生成上下文准确的语言描述和回答查询。该模型开发用于处理多样化的输入，包括图像和文本，支持广泛的应用，如图像字幕生成、视觉问答和多模态内容生成。

模型特点

多模态理解

整合视觉和语言数据，支持图像和文本的联合处理。

高级视觉场景解释

能够解释复杂的视觉场景，并生成上下文准确的语言描述。

创新的数据集生成方法

采用先进的算法从PDF文档中提取图像和文本描述，确保训练数据的高质量和上下文相关性。

材料科学应用

专注于材料科学领域，支持材料微观结构的2D和3D渲染生成。

模型能力

图像字幕生成

视觉问答

多模态内容生成

材料科学分析

多智能体AI交互

使用案例

材料科学

材料微观结构分析

分析材料微观结构的图像，生成详细的描述和分析报告。

提高材料设计的效率和准确性。

多智能体AI系统设计

基于自然界的观察（如蚂蚁行为）设计多智能体AI系统。

应用于机器人和材料科学中的高效和适应性强的运动系统。

教育

科学教育辅助

生成科学图像的解释和教学材料。

帮助学生更好地理解复杂的科学概念。

🚀 塞法洛（Cephalo）图像大语言模型

塞法洛（Cephalo）是一系列专注于材料科学的多模态视觉大语言模型（V - LLMs），旨在整合视觉和语言数据，以实现人类与AI或多智能体AI框架中的高级理解与交互。

🚀 快速开始

塞法洛（Cephalo）模型可以处理包括图像和文本在内的多种输入，适用于图像描述、视觉问答和多模态内容生成等广泛应用。以下是在GPU上快速开始的示例代码：

from PIL import Image 
import requests 

DEVICE='cuda:0'

from transformers import AutoProcessor, Idefics2ForConditionalGeneration 
from tqdm.notebook import tqdm
 
model_id='lamm-mit/Cephalo-Idefics-2-vision-10b-alpha'

model = Idefics2ForConditionalGeneration.from_pretrained(  model_id,
                                                           torch_dtype=torch.bfloat16, #if your GPU allows
                                                           _attn_implementation="flash_attention_2", #make sure Flash Attention 2 is installed
                                                           trust_remote_code=True,
                                                        ).to (DEVICE)
processor = AutoProcessor.from_pretrained(
    f"{model_id}",
    do_image_splitting=True
)

✨ 主要特性

多模态集成：能够整合视觉和语言数据，实现高级理解与交互。
创新数据集生成：采用先进算法从复杂PDF文档中准确提取图像及其对应文本描述，创建高质量图像 - 文本对。
广泛应用场景：适用于图像描述、视觉问答、多模态内容生成等多种应用。
强大架构：结合视觉编码器模型和自回归变压器，处理复杂自然语言理解。

📦 安装指南

若要使用塞法洛（Cephalo）模型，需安装相关依赖库，可使用以下命令：

pip install transformers pillow requests tqdm

若要使用Flash Attention 2加速生成，还需安装flash - attn：

pip install flash-attn

若要进行4位量化，需安装accelerate和bitsandbytes：

pip install accelerate bitsandbytes

💻 使用示例

基础用法

from transformers.image_utils import load_image

image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."},
        ]
    },
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# Get inputs using the processor
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

高级用法

def ask_about_image (model, processor, question, 
                     images_input=[], 
                     verbatim=False,
                     temperature=0.1,
                     show_image=False,
                     system="You are a biomaterials scientist who responds accurately. ", 
                     init_instr = "",
                     show_conversation=True,
                     max_new_tokens=256, 
                     messages=[], 
                     images=[], 
                     use_Markdown=False,
                    ):
    
   
    query = question
    images_input=ensure_list(images_input)
    if len (images)==0:
        if len (images_input)>0:
            for image in tqdm (images_input) :
                if is_url(image):
                    image= load_image(image)
                images.append (image)
                
                if show_image:
                    display ( image )
    if len (messages)==0:
       
        base_message = {
            "role": "user",
            "content": [
                {"type": "text", "text": system + init_instr},
                # Image messages will be added dynamically here
                {"type": "text", "text": query}
            ]
        }
        
        # Ensure the images_input is a list
        images_input = ensure_list(images_input)
        
        # Add image messages dynamically
        image_messages = [{"type": "image"} for _ in images_input]
        base_message["content"][1:1] = image_messages  # Insert image messages before the last text message
        
        # Append the constructed message to messages list
        messages.append(base_message)

    else:
        messages.append (
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": query
                    }
                ]
            }
        )
    if verbatim:
        print (messages)
        
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=[text.strip()], images=images, return_tensors="pt", padding=True).to(DEVICE)
     
    generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=True)
    generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)

    messages.append (
        {
            "role": "assistant",
            "content": [ {"type": "text", "text": generated_texts[0]}, ]
        }
    )
    formatted_conversation = format_conversation(messages, images)
    
    # Display the formatted conversation, e.g. in Jupyter Notebook
    if show_conversation:
     
        if use_Markdown:
            display(Markdown(formatted_conversation))
        else:
            display(HTML(formatted_conversation))

    return generated_texts, messages, images

question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."

url1 = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg" 

response, messages,images= ask_about_image ( model, processor, question, 
                                             images_input=[url1,],
                                             temperature=0.1,
                                             system= '', init_instr='You carefully study the image and provide detailed answers. Think step-by-step.\n\n', 
                                             show_conversation=True,
                                             max_new_tokens=512, messages=[], images=[])

📚 详细文档

聊天格式

lamm - mit/Cephalo - Idefics - 2 - vision - 10b - alpha模型适用于一个或多个图像输入，使用以下聊天格式的提示：

User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant:

对于多轮对话，提示应按以下格式设置：

User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi-agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:

若需要手动设置聊天模板：

IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
tokenizer.chat_template = IDEFICS2_CHAT_TEMPLATE
processor.tokenizer = tokenizer

🔧 技术细节

塞法洛（Cephalo）模型的开发具有创新的数据集生成方法，采用先进算法从复杂PDF文档中准确检测和分离图像及其对应文本描述。通过从PDF中提取图像和标题，创建合理的图像 - 文本对，并利用大语言模型（LLMs）进行自然语言处理。这些图像 - 文本对经过基于LLM的NLP处理进行细化和验证，确保训练数据的高质量和上下文相关性。

模型架构结合了视觉编码器模型和自回归变压器，处理复杂自然语言理解。此版本的lamm - mit/Cephalo - Idefics - 2 - vision - 10b - alpha基于https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta和HuggingFaceM4/idefics2 - 8b - chatty模型的合并扩展，增加了模型深度，专注于学习更复杂的表示和关联。

模型训练分多个阶段进行：

微调HuggingFaceM4/idefics2 - 8b - chatty模型，训练https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta。
将https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta解码器与HuggingFaceM4/idefics2 - 8b - chatty解码器的最后8层合并。
微调合并后的模型，该模型现在有40个解码器层，共100亿个参数。

模型在从维基百科和科学论文中提取的科学文本 - 图像数据组合上进行训练。

📄 许可证

本项目采用Apache 2.0许可证。

引用

请按以下格式引用：

@article{Buehler_Cephalo_2024,
  title={Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design},
  author={Markus J. Buehler},
  journal={arXiv preprint arXiv:2405.19076},
  year={2024}
}