Model Details

SFT based on meta-llama/Llama-2-7b-hf with merged alpaca datasets
DPO: trained on top of SFT model as LoRa Adapter, with merged hh-rlhf data
PPO: trained on top of dpo model and reward model, with multi-adapters, with PKU-SafeRLHF data for futher RLHF
Trained with Deepspeed ZeRO-1 + TRL + QLoRA + Flash-Attntion 2

Model and Training Details

Finetuned from model: meta-llama/Llama-2-7b-hf
Dataset:
- SFT (mixed train):
  - yahma/alpaca-cleaned
  - vicgalle/alpaca-gpt4
- DPO (mixed train):
  - Anthropic/hh-rlhf
  - Unified-Language-Model-Alignment/Anthropic_HH_Golden
- PPO:

Training Results

Evaluation

The reward score and toxicity scores are computed and compared with PKU-Alignment/PKU-SafeRLHF-30K data on SFT/DPO/PPO models

Model	Toxicity	Reward
SFT_v0.1	0.0698	-0.2828
DPO_v0.1	0.0356	-0.2633
PPO_v0.1	0.0321	0.38

Compute Infrastructure

The model is trained using 8 * RTX-3090-24GB/A100-PCIE-40GB

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True,)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token = DEFINE_EOS_TOKEN
model.config.eos_token_id = tokenizer.eos_token_id

def format_prompt(question):
    return f"###Question: {question}\n###Answer: "

instruction = "Your text here"
input = format_prompt(instruction)
inputs = tokenizer(input, return_tensors='pt')
output = model.generate(inputs['input_ids'], max_new_tokens=512, do_sample=False, top_p=1)
output = tokenizer.decode(output[0], skip_special_tokens=True)
print(output)

Model Card Authors

Yiyu (Michael) Ren

Model Card Contact

Email: [email protected]

Framework versions

PEFT 0.8.2

renyiyu
/

llama-2-7b-ppo-lora-v0.1