Llama-xLAM-2-8b-fc-r-GGUF开源语言模型 - 多轮对话与函数调用表现出色

首页

Llama Xlam 2 8b Fc R GGUF

由 Mungert 开发

基于特定技术生成的语言模型，在多轮对话和函数调用等任务中表现出色。

大型语言模型

Transformers

英语#多轮对话优化 #函数调用专用 #高精度量化

下载量 586

发布时间 : 6/21/2025

模型简介

Llama-xLAM-2-8b-fc-r GGUF模型是一个经过精细训练和优化的语言模型，适用于多轮对话和函数调用任务，为研究人员提供有力支持。

模型特点

多轮对话能力

能够在多轮对话场景中保持连贯的交互，准确理解用户意图并提供合适的回复。

函数调用优化

经过微调，在函数调用任务中表现出色，可根据用户需求调用特定工具完成任务。

高精度量化

采用新的量化方法，提升关键层的精度，显著提高了给定量化级别的精度。

广泛兼容性

与vLLM和基于Transformers的推理框架完全兼容，方便在不同环境中使用。

模型能力

多轮对话

函数调用

文本生成

使用案例

对话系统

客服对话

用于客服场景的多轮对话，理解用户问题并提供解决方案。

保持对话连贯性，准确理解用户意图。

工具调用

天气查询

根据用户请求调用天气查询工具，返回指定地点的天气信息。

准确调用工具并返回结果。

🚀 Llama-xLAM-2-8b-fc-r GGUF模型

Llama-xLAM-2-8b-fc-r GGUF模型是基于特定技术生成的语言模型，在多轮对话和函数调用等任务中表现出色。该模型经过精细训练和优化，适用于研究和实验，能够为相关领域的研究人员提供有力支持。

🚀 快速开始

本模型可用于多轮对话和函数调用等任务，以下是使用前的准备和基本使用步骤：

确保安装所需的框架版本，如Transformers 4.46.1（或更高版本）、PyTorch 2.5.1+cu124（或更高版本）等。
按照使用示例中的代码，使用Huggingface Chat Template或vLLM进行推理。

✨ 主要特性

多轮对话能力：能够在多轮对话场景中保持连贯的交互，准确理解用户意图并提供合适的回复。
函数调用优化：经过微调，在函数调用任务中表现出色，可根据用户需求调用特定工具完成任务。
高精度量化：采用新的量化方法，提升关键层的精度，尽管会增加模型文件大小，但显著提高了给定量化级别的精度。
广泛兼容性：与vLLM和基于Transformers的推理框架完全兼容，方便在不同环境中使用。

📦 安装指南

框架安装

Transformers 4.46.1（或更高版本）
PyTorch 2.5.1+cu124（或更高版本）
Datasets 3.1.0（或更高版本）
Tokenizers 0.20.3（或更高版本）

安装命令可根据各框架的官方文档进行，例如使用pip安装特定版本的Transformers：

pip install transformers>=4.46.1

💻 使用示例

基础用法

使用Huggingface Chat Template与模型进行交互：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Salesforce/Llama-xLAM-2-3b-fc-r")
model = AutoModelForCausalLM.from_pretrained("Salesforce/Llama-xLAM-2-3b-fc-r", torch_dtype=torch.bfloat16, device_map="auto")

# Example conversation with a tool call
messages = [
    {"role": "user", "content": "Hi, how are you?"},
    {"role": "assistant", "content": "Thanks. I am doing well. How can I help you?"},
    {"role": "user", "content": "What's the weather like in London?"},
]

tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit of temperature to return"}
            },
            "required": ["location"]
        }
    }
]

print("====== prompt after applying chat template ======")
print(tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False))

inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
input_ids_len = inputs["input_ids"].shape[-1] # Get the length of the input tokens
inputs = {k: v.to(model.device) for k, v in inputs.items()}
print("====== model response ======")
outputs = model.generate(**inputs, max_new_tokens=256)
generated_tokens = outputs[:, input_ids_len:] # Slice the output to get only the newly generated tokens
print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True))

高级用法

使用vLLM进行推理：

安装vLLM

pip install "vllm>=0.6.5"

下载工具解析插件

wget https://huggingface.co/Salesforce/xLAM-2-1b-fc-r/raw/main/xlam_tool_call_parser.py

启动OpenAI API兼容端点

vllm serve Salesforce/xLAM-2-1b-fc-r \
  --enable-auto-tool-choice \
  --tool-parser-plugin ./xlam_tool_call_parser.py \
  --tool-call-parser xlam \
  --tensor-parallel-size 1

测试工具使用

import openai
import json

# Configure the client to use your local vLLM endpoint
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",  # Default vLLM server URL
    api_key="empty"  # Can be any string
)

# Define a tool/function
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The unit of temperature to return"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Create a chat completion
response = client.chat.completions.create(
    model="Salesforce/xLAM-2-1b-fc-r",  # Model name doesn't matter, vLLM uses the served model
    messages=[
        {"role": "system", "content": "You are a helpful assistant that can use tools."},
        {"role": "user", "content": "What's the weather like in San Francisco?"}
    ],
    tools=tools,
    tool_choice="auto"
)

# Print the response
print("Assistant's response:")
print(json.dumps(response.model_dump(), indent=2))

📚 详细文档

模型系列

属性	详情
模型类型	Llama-xLAM-2-70b-fc-r、Llama-xLAM-2-8b-fc-r、xLAM-2-32b-fc-r、xLAM-2-3b-fc-r、xLAM-2-1b-fc-r等
训练数据	Salesforce/APIGen-MT-5k、Salesforce/xlam-function-calling-60k

各模型参数及下载链接

模型	总参数数量	上下文长度	类别	下载模型	下载GGUF文件
Llama-xLAM-2-70b-fc-r	70B	128k	多轮对话、函数调用	🤗 链接	NA
Llama-xLAM-2-8b-fc-r	8B	128k	多轮对话、函数调用	🤗 链接	🤗 链接
xLAM-2-32b-fc-r	32B	32k（最大128k）*	多轮对话、函数调用	🤗 链接	NA
xLAM-2-3b-fc-r	3B	32k（最大128k）*	多轮对话、函数调用	🤗 链接	🤗 链接
xLAM-2-1b-fc-r	1B	32k（最大128k）*	多轮对话、函数调用	🤗 链接	🤗 链接

*注意：基于Qwen-2.5的模型默认上下文长度为32k，但可以使用YaRN等技术实现最大128k的上下文长度。详情请参考此处。

模型后缀说明

-fc 后缀表示模型针对函数调用任务进行了微调。
-r 后缀表示这是一个研究版本的发布。

兼容性说明

✅ 所有模型均与vLLM和基于Transformers的推理框架完全兼容。

🔧 技术细节

模型生成

本模型使用 llama.cpp 在提交版本 6adc3c3e 下生成。

量化方法

在测试中发现，标准IMatrix量化在低比特深度下表现不佳，尤其是对于混合专家（MoE）模型。为了解决这个问题，使用 llama.cpp 中的 --tensor-type 选项手动将重要层的精度提升。具体实现可参考：使用llama.cpp进行层精度提升。虽然这种方法会增加模型文件大小，但显著提高了给定量化级别的精度。

📄 许可证

其他说明

基准测试结果

Berkeley Function-Calling Leaderboard (BFCL v3)

BFCL结果不同模型在 BFCL排行榜上的性能比较。排名基于整体准确率，即不同评估类别的加权平均值。“FC” 表示函数调用模式，与使用自定义 “提示” 提取函数调用相对。

τ-bench基准测试

τ-bench结果 τ-bench基准测试的成功率（pass@1），至少进行5次试验的平均值。我们的xLAM-2-70b-fc-r模型在τ-bench上的整体成功率达到56.2%，显著优于基础Llama 3.1 70B Instruct模型（38.2%）和其他开源模型如DeepSeek v3（40.6%）。值得注意的是，我们的最佳模型甚至优于专有模型如GPT-4o（52.9%），接近Claude 3.5 Sonnet（new）（60.1%）的性能。

Pass^k曲线衡量给定任务在所有5次独立试验中成功的概率，对τ-retail（左）和τ-airline（右）领域的所有任务进行平均。值越高表示模型的一致性越好。

伦理考虑

本发布仅用于支持学术论文的研究目的。我们的模型、数据集和代码并非专门为所有下游用途设计或评估。强烈建议用户在部署此模型之前评估并解决与准确性、安全性和公平性相关的潜在问题。鼓励用户考虑AI的常见局限性，遵守适用法律，并在选择用例时采用最佳实践，特别是在高风险场景中，错误或滥用可能会对人们的生活、权利或安全产生重大影响。有关用例的进一步指导，请参考我们的AUP和AI AUP。

引用说明

如果您在工作中使用了我们的模型或数据集，请引用我们的论文：

@article{prabhakar2025apigen,
  title={APIGen-MT: Agentic PIpeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay},
  author={Prabhakar, Akshara and Liu, Zuxin and Zhu, Ming and Zhang, Jianguo and Awalgaonkar, Tulika and Wang, Shiyu and Liu, Zhiwei and Chen, Haolin and Hoang, Thai and others},
  journal={arXiv preprint arXiv:2504.03601},
  year={2025}
}

此外，请查看我们关于xLAM系列的其他优秀相关工作，并考虑引用它们：

@article{zhang2025actionstudio,
  title={ActionStudio: A Lightweight Framework for Data and Training of Action Models},
  author={Zhang, Jianguo and Hoang, Thai and Zhu, Ming and Liu, Zuxin and Wang, Shiyu and Awalgaonkar, Tulika and Prabhakar, Akshara and Chen, Haolin and Yao, Weiran and Liu, Zhiwei and others},
  journal={arXiv preprint arXiv:2503.22673},
  year={2025}
}

@article{zhang2024xlam,
  title={xLAM: A Family of Large Action Models to Empower AI Agent Systems},
  author={Zhang, Jianguo and Lan, Tian and Zhu, Ming and Liu, Zuxin and Hoang, Thai and Kokane, Shirley and Yao, Weiran and Tan, Juntao and Prabhakar, Akshara and Chen, Haolin and others},
  journal={arXiv preprint arXiv:2409.03215},
  year={2024}
}

@article{liu2024apigen,
  title={Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets},
  author={Liu, Zuxin and Hoang, Thai and Zhang, Jianguo and Zhu, Ming and Lan, Tian and Tan, Juntao and Yao, Weiran and Liu, Zhiwei and Feng, Yihao and RN, Rithesh and others},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={54463--54482},
  year={2024}
}

@article{zhang2024agentohana,
  title={AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning},
  author={Zhang, Jianguo and Lan, Tian and Murthy, Rithesh and Liu, Zhiwei and Yao, Weiran and Tan, Juntao and Hoang, Thai and Yang, Liangwei and Feng, Yihao and Liu, Zuxin and others},
  journal={arXiv preprint arXiv:2402.15506},
  year={2024}
}

模型测试邀请

如果您发现这些模型有用，请帮助测试 AI驱动的量子网络监控助手 并进行 量子就绪安全检查： 👉 量子网络监控器

量子网络监控服务的完整开源代码可在github仓库（名称中包含NetworkMonitor的仓库）中找到：量子网络监控器源代码。如果您想自己进行模型量化，也可以在 GGUFModelBuilder 中找到相关代码。

测试说明

选择AI助手类型：
- TurboLLM (GPT-4.1-mini)
- HugLLM (Hugginface开源模型)
- TestLLM (仅支持CPU的实验性模型)

测试内容

正在测试小型开源模型在AI网络监控方面的极限，具体包括：

针对实时网络服务进行 函数调用。
探索模型在处理以下任务时的最小规模：
- 自动进行 Nmap安全扫描。
- 量子就绪检查。
- 网络监控任务。

不同助手特点

🟡 TestLLM – 当前实验性模型（在huggingface docker空间的2个CPU线程上运行llama.cpp）：
- ✅ 零配置设置。
- ⏳ 加载时间30秒（推理速度慢，但 无API成本）。由于成本低，无令牌限制。
- 🔧 寻求帮助！ 如果您对边缘设备AI感兴趣，让我们合作！
🟢 TurboLLM – 使用 gpt-4.1-mini：
- 性能出色，但不幸的是OpenAI按令牌收费。因此，令牌使用受限。
- 创建自定义cmd处理器以在量子网络监控代理上运行.net代码。
- 实时网络诊断和监控。
- 安全审计。
- 渗透测试 (Nmap/Metasploit)。
🔵 HugLLM – 最新开源模型：
- 🌐 在Hugging Face推理API上运行。使用Novita托管的最新模型时表现良好。

测试命令示例

"Give me info on my websites SSL certificate"
"Check if my server is using quantum safe encyption for communication"
"Run a comprehensive security audit on my server"
'"Create a cmd processor to .. (what ever you want)" 注意，您需要安装量子网络监控代理才能运行.net代码。这是一个非常灵活和强大的功能，请谨慎使用！