Llama JoyCaption Alpha Two FP8压缩开源模型 - 免费部署，兼容框架高效使用

首页

Llama Joycaption Alpha Two Hf Llava FP8 Dynamic

由 JKCHSTR 开发

这是对fancyfeast开发的Llama JoyCaption Alpha Two模型进行的FP8压缩版本，使用llm-compressor工具实现，并兼容vllm框架。

图像生成文本

Safetensors

英语开源协议:MIT #FP8动态量化 #图像描述生成 #Llava架构

下载量 248

发布时间 : 2/18/2025

模型简介

该模型是基于Llama JoyCaption Alpha Two的FP8压缩版本，主要用于图像字幕生成任务，能够为图片生成详细描述。

模型特点

FP8动态量化

使用llm-compressor工具实现的FP8压缩版本，减少模型大小同时保持性能

vllm兼容

优化后兼容vllm框架，便于部署和使用

图像描述生成

能够为输入的图片生成详细、专业的描述

模型能力

图像分析

文本生成

多模态处理

使用案例

内容生成

图片描述生成

为图片生成正式、详细的描述

生成符合要求的图片描述文本

辅助工具

视觉辅助

为视障人士提供图片内容描述

🚀 Llama Joycaption Alpha Two hf Llava FP8 Dynamic

这是对 fancyfeast 制作的 Llama JoyCaption Alpha Two 模型进行的 FP8 压缩版本，使用了 llm-compressor 工具，并且与 vllm 兼容。

该模型已由本人进行测试，遗憾的是测试方法可能不够规范，但在我的使用场景下表现良好。

所有荣誉归于 fancyfeast，你可以在官方模型页面查看更多详细信息。

🚀 快速开始

与 Llama JoyCaption Alpha Two 模型的使用方法相同：

你需要 compressed-tensors 库才能以 FP8 模式运行以下代码。

💻 使用示例

基础用法

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration


IMAGE_PATH = "image.jpg"
PROMPT = "Write a long descriptive caption for this image in a formal tone."
MODEL_NAME = "JKCHSTR/llama-joycaption-alpha-two-hf-llava-FP8-Dynamic"


# Load JoyCaption
# bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1)
# device_map=0 loads the model into the first GPU
processor = AutoProcessor.from_pretrained(MODEL_NAME)
llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
llava_model.eval()

with torch.no_grad():
    # Load image
    image = Image.open(IMAGE_PATH)

    # Build the conversation
    convo = [
        {
            "role": "system",
            "content": "You are a helpful image captioner.",
        },
        {
            "role": "user",
            "content": PROMPT,
        },
    ]

    # Format the conversation
    # WARNING: HF's handling of chat's on Llava models is very fragile.  This specific combination of processor.apply_chat_template(), and processor() works
    # but if using other combinations always inspect the final input_ids to ensure they are correct.  Often times you will end up with multiple <bos> tokens
    # if not careful, which can make the model perform poorly.
    convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)
    assert isinstance(convo_string, str)

    # Process the inputs
    inputs = processor(text=[convo_string], images=[image], return_tensors="pt").to('cuda')
    inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)

    # Generate the captions
    generate_ids = llava_model.generate(
        **inputs,
        max_new_tokens=300,
        do_sample=True,
        suppress_tokens=None,
        use_cache=True,
        temperature=0.6,
        top_k=None,
        top_p=0.9,
    )[0]

    # Trim off the prompt
    generate_ids = generate_ids[inputs['input_ids'].shape[1]:]

    # Decode the caption
    caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    caption = caption.strip()
    print(caption)