🚀 Llama3-8b奖励模型
这是一个基于Llama3-8b的奖励模型,使用OpenRLHF进行训练。OpenRLHF是一种高效的基于人类反馈的强化学习(RLHF)框架,相关内容发表于论文REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models。
该模型结合了OpenLLMAI/preference_700K中的数据集进行训练。
基础监督微调(SFT)模型:OpenRLHF/Llama-3-8b-sft-mixture
🚀 快速开始
你可以使用Hugging Face的transformers
库,利用这个模型对给定提示生成的回复质量进行评分。输入格式应与模型训练时的格式相匹配(例如,使用Llama 3聊天模板的完整对话轮次)。
💻 使用示例
基础用法
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "OpenRLHF/Llama-3-8b-rm-mixture"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
prompt = "Write a short poem about a cat."
response_good = "A feline friend, soft and sleek,\
Curled up warm, a purring peek.\
Through sunlit naps and playful chase,\
Graceful paws in every space."
response_bad = "Cats are okay. They sit sometimes. Dog is better."
messages_good = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": response_good},
]
messages_bad = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": response_bad},
]
input_ids_good = tokenizer.apply_chat_template(messages_good, return_tensors="pt", add_generation_prompt=False).to(model.device)
input_ids_bad = tokenizer.apply_chat_template(messages_bad, return_tensors="pt", add_generation_prompt=False).to(model.device)
with torch.no_grad():
score_good = model(input_ids_good).logits.item()
score_bad = model(input_ids_bad).logits.item()
print(f"Score for good response: {score_good:.2f}")
print(f"Score for bad response: {score_bad:.2f}")
🔧 技术细节
训练配置
Cosine Scheduler
学习率:9e-6
预热比例:0.03
批量大小:256
轮数:1
📄 许可证
本项目采用CC BY-NC 4.0许可证。