模型简介
模型特点
模型能力
使用案例
🚀 塞法洛(Cephalo)图像大语言模型
塞法洛(Cephalo)是一系列专注于材料科学的多模态视觉大语言模型(V - LLMs),旨在整合视觉和语言数据,以实现人类与AI或多智能体AI框架中的高级理解与交互。
🚀 快速开始
塞法洛(Cephalo)模型可以处理包括图像和文本在内的多种输入,适用于图像描述、视觉问答和多模态内容生成等广泛应用。以下是在GPU上快速开始的示例代码:
from PIL import Image
import requests
DEVICE='cuda:0'
from transformers import AutoProcessor, Idefics2ForConditionalGeneration
from tqdm.notebook import tqdm
model_id='lamm-mit/Cephalo-Idefics-2-vision-10b-alpha'
model = Idefics2ForConditionalGeneration.from_pretrained( model_id,
torch_dtype=torch.bfloat16, #if your GPU allows
_attn_implementation="flash_attention_2", #make sure Flash Attention 2 is installed
trust_remote_code=True,
).to (DEVICE)
processor = AutoProcessor.from_pretrained(
f"{model_id}",
do_image_splitting=True
)
✨ 主要特性
- 多模态集成:能够整合视觉和语言数据,实现高级理解与交互。
- 创新数据集生成:采用先进算法从复杂PDF文档中准确提取图像及其对应文本描述,创建高质量图像 - 文本对。
- 广泛应用场景:适用于图像描述、视觉问答、多模态内容生成等多种应用。
- 强大架构:结合视觉编码器模型和自回归变压器,处理复杂自然语言理解。
📦 安装指南
若要使用塞法洛(Cephalo)模型,需安装相关依赖库,可使用以下命令:
pip install transformers pillow requests tqdm
若要使用Flash Attention 2加速生成,还需安装flash - attn
:
pip install flash-attn
若要进行4位量化,需安装accelerate
和bitsandbytes
:
pip install accelerate bitsandbytes
💻 使用示例
基础用法
from transformers.image_utils import load_image
image = load_image("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg")
# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
# Get inputs using the processor
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
高级用法
def ask_about_image (model, processor, question,
images_input=[],
verbatim=False,
temperature=0.1,
show_image=False,
system="You are a biomaterials scientist who responds accurately. ",
init_instr = "",
show_conversation=True,
max_new_tokens=256,
messages=[],
images=[],
use_Markdown=False,
):
query = question
images_input=ensure_list(images_input)
if len (images)==0:
if len (images_input)>0:
for image in tqdm (images_input) :
if is_url(image):
image= load_image(image)
images.append (image)
if show_image:
display ( image )
if len (messages)==0:
base_message = {
"role": "user",
"content": [
{"type": "text", "text": system + init_instr},
# Image messages will be added dynamically here
{"type": "text", "text": query}
]
}
# Ensure the images_input is a list
images_input = ensure_list(images_input)
# Add image messages dynamically
image_messages = [{"type": "image"} for _ in images_input]
base_message["content"][1:1] = image_messages # Insert image messages before the last text message
# Append the constructed message to messages list
messages.append(base_message)
else:
messages.append (
{
"role": "user",
"content": [
{"type": "text", "text": query
}
]
}
)
if verbatim:
print (messages)
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text.strip()], images=images, return_tensors="pt", padding=True).to(DEVICE)
generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=True)
generated_texts = processor.batch_decode(generated_ids[:, inputs["input_ids"].size(1):], skip_special_tokens=True)
messages.append (
{
"role": "assistant",
"content": [ {"type": "text", "text": generated_texts[0]}, ]
}
)
formatted_conversation = format_conversation(messages, images)
# Display the formatted conversation, e.g. in Jupyter Notebook
if show_conversation:
if use_Markdown:
display(Markdown(formatted_conversation))
else:
display(HTML(formatted_conversation))
return generated_texts, messages, images
question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."
url1 = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg"
response, messages,images= ask_about_image ( model, processor, question,
images_input=[url1,],
temperature=0.1,
system= '', init_instr='You carefully study the image and provide detailed answers. Think step-by-step.\n\n',
show_conversation=True,
max_new_tokens=512, messages=[], images=[])
📚 详细文档
聊天格式
lamm - mit/Cephalo - Idefics - 2 - vision - 10b - alpha
模型适用于一个或多个图像输入,使用以下聊天格式的提示:
User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant:
对于多轮对话,提示应按以下格式设置:
User: You carefully study the image, and respond accurately, but succinctly. Think step-by-step.
<image>What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI.<end_of_utterance>
Assistant: The image depicts ants climbing a vertical surface using their legs and claws. This behavior is observed in nature and can inspire the design of multi-agent AI systems that mimic the coordinated movement of these insects. The relevance lies in the potential application of such systems in robotics and materials science, where efficient and adaptive movement is crucial.<end_of_utterance>
User: How could this be used to design a fracture resistant material?<end_of_utterance>
Assistant:
若需要手动设置聊天模板:
IDEFICS2_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
tokenizer.chat_template = IDEFICS2_CHAT_TEMPLATE
processor.tokenizer = tokenizer
🔧 技术细节
塞法洛(Cephalo)模型的开发具有创新的数据集生成方法,采用先进算法从复杂PDF文档中准确检测和分离图像及其对应文本描述。通过从PDF中提取图像和标题,创建合理的图像 - 文本对,并利用大语言模型(LLMs)进行自然语言处理。这些图像 - 文本对经过基于LLM的NLP处理进行细化和验证,确保训练数据的高质量和上下文相关性。
模型架构结合了视觉编码器模型和自回归变压器,处理复杂自然语言理解。此版本的lamm - mit/Cephalo - Idefics - 2 - vision - 10b - alpha
基于https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta
和HuggingFaceM4/idefics2 - 8b - chatty
模型的合并扩展,增加了模型深度,专注于学习更复杂的表示和关联。
模型训练分多个阶段进行:
- 微调
HuggingFaceM4/idefics2 - 8b - chatty
模型,训练https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta
。 - 将
https://huggingface.co/lamm - mit/Cephalo - Idefics - 2 - vision - 8b - beta
解码器与HuggingFaceM4/idefics2 - 8b - chatty
解码器的最后8层合并。 - 微调合并后的模型,该模型现在有40个解码器层,共100亿个参数。
模型在从维基百科和科学论文中提取的科学文本 - 图像数据组合上进行训练。
📄 许可证
本项目采用Apache 2.0许可证。
引用
请按以下格式引用:
@article{Buehler_Cephalo_2024,
title={Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design},
author={Markus J. Buehler},
journal={arXiv preprint arXiv:2405.19076},
year={2024}
}








