finance-Llama3-8B开源金融模型 - 免费部署，性能超越Llama3-70B

首页

Finance Llama3 8B

由 instruction-pretrain 开发

基于Llama3-8B开发的金融领域模型，通过指令预训练框架增强领域适应能力，在金融任务上达到甚至超越Llama3-70B的性能。

大型语言模型

Transformers

英语#金融领域适配 #指令增强预训练 #多任务学习

下载量 1,200

发布时间 : 6/18/2024

模型简介

该模型采用指令预训练框架，通过合成指令-响应对增强原始语料库进行持续预训练，专门优化金融领域任务处理能力。

模型特点

指令预训练框架

通过合成指令-响应对增强海量原始语料库，显著提升模型在领域任务上的表现。

领域自适应能力

在金融领域持续预训练后，8B参数模型性能可超越原版70B模型。

多阶段训练优化

支持从头预训练和持续预训练两种模式，均优于传统预训练方法。

模型能力

金融文本理解

金融问题解答

金融数据分析

金融报告生成

使用案例

金融分析

证券信息查询

解析上市公司证券注册信息，回答关于债务证券的复杂查询

能准确识别交易代码与证券类型的对应关系

财务报告分析

理解并提取财务报告中的关键数据

🚀 指令预训练：语言模型是有监督的多任务学习者 (EMNLP 2024)

本项目包含我们在论文指令预训练：语言模型是有监督的多任务学习者中基于 Llama3 - 8B 开发的金融模型。我们提出了 指令预训练 框架，探索有监督的多任务预训练，该框架可扩展地用指令 - 响应对对大量原始语料进行扩充，以预训练语言模型。指令 - 响应对由基于开源模型构建的高效指令合成器生成。指令预训练 在从头开始的通用预训练和特定领域自适应持续预训练中均优于 普通预训练。在从头开始的预训练中，指令预训练 不仅能改进预训练的基础模型，还能从进一步的指令调优中获得更多收益。在持续预训练中，指令预训练 使 Llama3 - 8B 能够与 Llama3 - 70B 相媲美甚至超越它。

✨ 主要特性

提出 指令预训练 框架，可扩展地用指令 - 响应对对原始语料进行扩充。
指令 - 响应对由高效指令合成器生成。
在通用预训练和领域自适应持续预训练中均优于普通预训练。

📦 安装指南

若要评估任何 Huggingface 语言模型在特定领域任务上的表现，可按以下步骤设置依赖：

git clone https://github.com/microsoft/LMOps
cd LMOps/adaptllm
pip install -r requirements.txt

💻 使用示例

基础用法

与 finance - Llama3 - 8B 模型进行对话：

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("instruction-pretrain/finance-Llama3-8B")
tokenizer = AutoTokenizer.from_pretrained("instruction-pretrain/finance-Llama3-8B")

# Put your input here, NO prompt template is required
user_input = '''Use this fact to answer the question: Title of each class Trading Symbol(s) Name of each exchange on which registered
Common Stock, Par Value $.01 Per Share MMM New York Stock Exchange
MMM Chicago Stock Exchange, Inc.
1.500% Notes due 2026 MMM26 New York Stock Exchange
1.750% Notes due 2030 MMM30 New York Stock Exchange
1.500% Notes due 2031 MMM31 New York Stock Exchange

Which debt securities are registered to trade on a national securities exchange under 3M's name as of Q2 of 2023?'''

inputs = tokenizer(user_input, return_tensors="pt", add_special_tokens=True).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_new_tokens=400)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(pred)

高级用法

评估任何 Huggingface 语言模型在特定领域任务上的表现（💡新功能！）：

# Select the domain from ['biomedicine', 'finance']
DOMAIN='finance'
  
# Specify any Huggingface LM name (Not applicable to models requiring specific prompt templates)
MODEL='instruction-pretrain/finance-Llama3-8B'
  
# Model parallelization:
# - Set MODEL_PARALLEL=False if the model fits on a single GPU. 
#   We observe that LMs smaller than 10B always meet this requirement.
# - Set MODEL_PARALLEL=True if the model is too large and encounters OOM on a single GPU.
MODEL_PARALLEL=False
  
# Choose the number of GPUs from [1, 2, 4, 8]
N_GPU=1
  
# Whether to add a BOS token at the beginning of the prompt input:
# - Set to False for AdaptLLM.
# - Set to True for instruction-pretrain models.
# If unsure, we recommend setting it to False, as this is suitable for most LMs.
add_bos_token=True

# Run the evaluation script
bash scripts/inference.sh ${DOMAIN} ${MODEL} ${add_bos_token} ${MODEL_PARALLEL} ${N_GPU}

📚 详细文档

资源

🤗 我们分享了数据和模型以及使用示例，欢迎在此页面展开讨论！🤗

感谢 [davanstrien/instruction - synthesizer](https://huggingface.co/spaces/davanstrien/instruction - synthesizer) 的演示实现了我们的方法。
基于上下文的指令合成器：[instruction - synthesizer](https://huggingface.co/instruction - pretrain/instruction - synthesizer)
合成器的微调数据：[ft - instruction - synthesizer - collection](https://huggingface.co/datasets/instruction - pretrain/ft - instruction - synthesizer - collection)
从头开始预训练的通用模型（基于 100B 标记）：
- [InstructLM - 500M](https://huggingface.co/instruction - pretrain/InstructLM - 500M)
- [InstructLM - 1.3B](https://huggingface.co/instruction - pretrain/InstructLM - 1.3B)
基于 Llama3 - 8B 预训练的特定领域模型：
- [Finance - Llama3 - 8B](https://huggingface.co/instruction - pretrain/finance - Llama3 - 8B)
- [Biomedicine - Llama3 - 8B](https://huggingface.co/instruction - pretrain/medicine - Llama3 - 8B)
通用指令增强语料库：[general - instruction - augmented - corpora](https://huggingface.co/datasets/instruction - pretrain/general - instruction - augmented - corpora)
特定领域指令增强语料库（为避免伦理问题，无金融数据）：[medicine - instruction - augmented - corpora](https://huggingface.co/datasets/instruction - pretrain/medicine - instruction - augmented - corpora)

领域自适应持续预训练

遵循 [AdaptLLM](https://huggingface.co/AdaptLLM/finance - chat)，我们使用 [基于上下文的指令合成器](https://huggingface.co/instruction - pretrain/instruction - synthesizer) 生成的指令 - 响应对对特定领域的原始语料进行扩充。

常见问题解答

问题 1：你们在预训练中使用官方的 Llama3 指令提示吗？ 不，提供的 Llama3 指令提示是为 [指令调优模型](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B - Instruct) 设计的，而我们的持续预训练是在 [预训练基础模型](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B) 上进行的，只需要 BOS (<|begin_of_text|>) 和 EOS (<|end_of_text|>) 标记。

问题 2：对于来自 OpenOrca 的通用指令，你们是否使用 '\n' 将每个指令与其输出连接起来？ 不，如预训练建议中所述，对于来自 OpenOrca 的通用指令数据，我们使用简单的空格将每个问题与其响应连接起来。这是因为 OpenOrca 的数据已经使用了多种自然语言模板（如包含 \n 的模板），所以空格足以处理数据。请注意，使用我们模板化的指令增强文本时，无需添加任何连接符。

问题 3：OpenOrca 中的那些系统提示怎么办？ 我们直接丢弃系统提示。

综上所述，标记化之前的文本如下：

general_instruction_response_text = "<|begin_of_text|>{question} {response}<|end_of_text|>"

instruction_augmented_text = "<|begin_of_text|>{instruction augmented text}<|end_of_text|>"

然后，进行标记化时，无需添加 BOS 和 EOS 标记 ID。标记化代码如下：

text_ids = tokenizer(text, add_special_tokens=False, **kwargs).input_ids

📄 许可证

本项目使用 Llama3 许可证。

📖 引用

如果您觉得我们的工作有帮助，请引用我们：指令预训练 (EMNLP 2024)

@article{cheng2024instruction,
  title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
  author={Cheng, Daixuan and Gu, Yuxian and Huang, Shaohan and Bi, Junyu and Huang, Minlie and Wei, Furu},
  journal={arXiv preprint arXiv:2406.14491},
  year={2024}
}

将大语言模型适配到特定领域 (ICLR 2024)

@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}

🔧 技术细节

模型类型：基于 Llama3 - 8B 开发的金融模型。
训练数据：使用了 Open - Orca/OpenOrca、GAIR/lima、WizardLM/WizardLM_evol_instruct_V2_196k 等数据集。

属性	详情
模型类型	基于 Llama3 - 8B 开发的金融模型
训练数据	Open - Orca/OpenOrca、GAIR/lima、WizardLM/WizardLM_evol_instruct_V2_196k

🌟 更新日志

2024/11/30：发布了指令合成器的多模态版本：[视觉指令合成器](https://huggingface.co/AdaptLLM/Adapt - MLLM - to - Domains)
2024/9/20：我们的论文被 EMNLP 2024 主会议接受🎉
2024/9/11：更新了 [关于从 Llama3 进行持续预训练的常见问题解答](https://huggingface.co/instruction - pretrain/instruction - synthesizer)
2024/8/29：更新了 [评估任何 🤗Huggingface 模型在特定领域任务上的指南](https://huggingface.co/instruction - pretrain/medicine - Llama3 - 8B)
2024/7/31：更新了 [指令合成器](https://huggingface.co/instruction - pretrain/instruction - synthesizer) 高级用法 部分的预训练建议
2024/7/15：我们将预训练标记从 100B 扩展到 250B，合成的指令 - 响应对数量达到 5 亿。预训练过程中在下游任务上的性能趋势：

* 2024/6/21：发布了 [论文](https://huggingface.co/papers/2406.14491)、[代码](https://github.com/microsoft/LMOps) 和 [资源](https://huggingface.co/instruction - pretrain)