File size: 2,632 Bytes
a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 a045f04 0da9f78 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
library_name: transformers
tags:
- dpo
- rlhf
- trl
license: apache-2.0
language:
- en
pipeline_tag: text-generation
---
# Llama3-8B-SuperNova-Spectrum-Hermes-DPO
This model is a **DPO fine-tuned** version of my `DARE_TIES` merged Model [`yuvraj17/Llama3-8B-SuperNova-Spectrum-dare_ties`](https://huggingface.co/yuvraj17/Llama3-8B-SuperNova-Spectrum-dare_ties) on the [yuvraj17/chatml-OpenHermes2.5-dpo-binarized-alpha-2k](https://huggingface.co/datasets/yuvraj17/chatml-OpenHermes2.5-dpo-binarized-alpha-2k) dataset.
## DPO (Direct Preference Optimization):
Direct Preference Optimization (DPO) is a fine-tuning technique that focuses on aligning a model's responses with human preferences or ranking data without requiring reinforcement learning steps, like in RLHF.
<figure>
<img src="https://cdn-uploads.huggingface.co/production/uploads/66137d95e8d2cda230ddcea6/kHcU5dkcSVqxEIWt_GRUB.png" width="1000" height="768">
<figcaption> DPO vs RLHF <a href="//arxiv.org/abs/2305.18290">Reference</a> </figcaption>
</figure>
## Training:
- Trained on **1x A40s (48GB VRAM)** using the [HuggingFace TRL](https://huggingface.co/docs/trl/index).
- **QLoRA**(`4-bit precision`) for 1 epoch
```
# LoRA configuration
peft_config = LoraConfig(
r=32,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)
```
### Training Params
The following hyperparameters were used during training:
- learning_rate: 5e-05
- beta=0.1
- num_devices: 1
- gradient_accumulation_steps: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 1
### Training Time = **1:57:00** hours
### Weight & Biases Report
[Report-Link](https://api.wandb.ai/links/my-sft-team/d211juao)
## 💻 Usage
```python
!pip install -qU transformers accelerate
from transformers import AutoTokenizer
import transformers
import torch
model = "yuvraj17/Llama3-8B-SuperNova-Spectrum-Hermes-DPO"
messages = [{"role": "user", "content": "What is a large language model?"}]
tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```
## 🏆 Evaluation Scores
Coming Soon |