File size: 2,632 Bytes
a045f04
 
0da9f78
 
 
 
 
 
 
 
a045f04
 
0da9f78
a045f04
0da9f78
a045f04
0da9f78
a045f04
0da9f78
a045f04
0da9f78
a045f04
0da9f78
 
a045f04
0da9f78
a045f04
0da9f78
a045f04
0da9f78
 
 
 
 
 
 
 
 
 
 
 
 
 
a045f04
0da9f78
 
 
 
 
 
 
 
 
a045f04
0da9f78
a045f04
0da9f78
a045f04
0da9f78
a045f04
0da9f78
a045f04
0da9f78
 
a045f04
0da9f78
 
 
a045f04
0da9f78
 
a045f04
0da9f78
 
 
 
 
 
 
 
a045f04
0da9f78
 
 
a045f04
0da9f78
a045f04
0da9f78
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
library_name: transformers
tags:
- dpo
- rlhf
- trl
license: apache-2.0
language:
- en
pipeline_tag: text-generation
---

# Llama3-8B-SuperNova-Spectrum-Hermes-DPO

This model is a **DPO fine-tuned** version of my `DARE_TIES` merged Model [`yuvraj17/Llama3-8B-SuperNova-Spectrum-dare_ties`](https://huggingface.co/yuvraj17/Llama3-8B-SuperNova-Spectrum-dare_ties) on the [yuvraj17/chatml-OpenHermes2.5-dpo-binarized-alpha-2k](https://huggingface.co/datasets/yuvraj17/chatml-OpenHermes2.5-dpo-binarized-alpha-2k) dataset.

## DPO (Direct Preference Optimization):

Direct Preference Optimization (DPO) is a fine-tuning technique that focuses on aligning a model's responses with human preferences or ranking data without requiring reinforcement learning steps, like in RLHF. 

<figure>

  <img src="https://cdn-uploads.huggingface.co/production/uploads/66137d95e8d2cda230ddcea6/kHcU5dkcSVqxEIWt_GRUB.png" width="1000" height="768">
  <figcaption> DPO vs RLHF <a href="//arxiv.org/abs/2305.18290">Reference</a> </figcaption>

</figure>

## Training:

- Trained on **1x A40s (48GB VRAM)** using the [HuggingFace TRL](https://huggingface.co/docs/trl/index).
- **QLoRA**(`4-bit precision`) for 1 epoch
  ```
  # LoRA configuration
  peft_config = LoraConfig(
      r=32,
      lora_alpha=16,
      lora_dropout=0.05,
      bias="none",
      task_type="CAUSAL_LM",
      target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
  )
  ```
### Training Params

The following hyperparameters were used during training:
- learning_rate: 5e-05
- beta=0.1
- num_devices: 1
- gradient_accumulation_steps: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 1

### Training Time = **1:57:00** hours

### Weight & Biases Report

[Report-Link](https://api.wandb.ai/links/my-sft-team/d211juao)

## 💻 Usage

```python
!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "yuvraj17/Llama3-8B-SuperNova-Spectrum-Hermes-DPO"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```

## 🏆 Evaluation Scores

Coming Soon