license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
tags:
SAIL
【📖 GitHub】
【📜 论文】
【🚀 快速开始】
简介
SAIL是一个专为视觉与语言设计的单一Transformer模型。作为统一的多模态大语言模型(MLLM),它在单一架构中无缝集成了原始像素编码和语言解码功能。无需依赖预训练的视觉编码器,SAIL就能在广泛的视觉语言任务中展现出色性能,其强大的视觉表征能力可与最先进的视觉模型在语义分割等任务中相媲美。
模型
模型名称 |
Hugging Face链接 |
SAIL-7B |
❤️ 链接 |
快速开始
我们提供运行SAIL
的示例代码:
from example import *
NON_VISION_TOKEN_ID = -1
PATH_TO_MODEL = "模型路径"
PATH_TO_TOKENIZER = "分词器路径"
IMAGE_PATH = "图片路径"
PROMPT = "提示词内容"
model, tokenizer = get_transformer_and_tokenizer(
PATH_TO_MODEL,
PATH_TO_TOKENIZER
)
model = model.cuda()
image_processor = lambda x: convert_image_base64_to_patches(load_image_to_base64(x), model.config.vision_patch_size, fix_res_size=None)
prompt_inp = tokenizer.bos_token + '[INST] {} [/INST]'.format(PROMPT)
image_path = IMAGE_PATH
image_patches = image_processor(image_path)
nh, nw = image_patches.shape[:2]
image_tokens, image_tokens_len = prepare_image_textual_seq_norowsep(nh, nw, tokenizer, add_cls=False)
input_tokens = image_tokens + prompt_inp
input_ids = tokenizer(input_tokens, add_special_tokens=False, return_tensors="pt").input_ids
vision_patch_indices = torch.full_like(input_ids, fill_value=NON_VISION_TOKEN_ID)
vision_patches = image_patches.view(nh * nw, -1)
assert (input_ids == tokenizer.vis_patch_tok_id).sum() == vision_patches.size(0)
assert (input_ids >= tokenizer.vis_beg_tok_id).sum() == image_tokens_len
vision_patch_indices[input_ids==tokenizer.vis_patch_tok_id] = torch.arange(vision_patches.size(0))
attention_mask = create_single_prefix_mask(image_tokens_len, input_ids.size(-1)).unsqueeze(0).unsqueeze(0)
position_ids = generate_mm_pos_ids_singleit(input_ids.squeeze(0).numpy().tolist(), tokenizer.vis_patch_tok_id, nh, nw).unsqueeze(1)
input_ids = input_ids.long().cuda()
vision_patch_indices = vision_patch_indices.long().cuda()
vision_patches = vision_patches.to(torch.bfloat16).cuda()
position_ids = position_ids.long().cuda()
attention_mask = attention_mask.cuda()
padding_attention_mask = torch.ones_like(input_ids).cuda()
inputs = dict(
input_ids = input_ids,
position_ids = position_ids,
attention_mask = padding_attention_mask,
vision_patches = vision_patches,
vision_patch_indices = vision_patch_indices,
use_cache=True
)
cached_inputs = dict(
input_ids = input_ids[:, :image_tokens_len],
position_ids = position_ids[:, :, :image_tokens_len],
attention_mask = attention_mask[:,:, :image_tokens_len, :image_tokens_len],
vision_patches = vision_patches,
vision_patch_indices = vision_patch_indices[:, :image_tokens_len],
use_cache=True
)
prefix_cache = DynamicCache()
with torch.no_grad():
prefix_cache = model.forward(**cached_inputs, past_key_values=prefix_cache).past_key_values
past_key_values = copy.deepcopy(prefix_cache)
generate_config = GenerationConfig(
max_new_tokens=1024,
return_dict_in_generate=True,
output_attentions=False
)
generated = model.generate(
**inputs,
past_key_values=past_key_values,
generation_config=generate_config
)
generated_ids = generated['sequences'][:, input_ids.size(1):]
response = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f"\n模型响应: ===\n{response}\n===")
引用
如果您在研究中使用了本项目,请考虑引用:
@article{lei2025sail,
title={简约的可扩展性:基于单Transformer的视觉语言学习实证分析},
author={雷伟贤 and 王嘉聪 and 王浩辰 and 李祥泰 and 刘俊豪 and 冯佳时 and 黄子龙},
journal={arXiv预印本 arXiv:2504.10462},
year={2025}
}