JW17 commited on
Commit
24c1172
1 Parent(s): 97c073c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md CHANGED
@@ -1,3 +1,106 @@
1
  ---
 
 
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  license: mit
5
+ base_model:
6
+ - mistralai/Mistral-7B-v0.1
7
+ datasets:
8
+ - argilla/distilabel-capybara-dpo-7k-binarized
9
+ pipeline_tag: text-generation
10
+ model-index:
11
+ - name: Mistral-ORPO-Capybara-7k
12
+ results:
13
+ - task:
14
+ type: text-generation
15
+ dataset:
16
+ name: AlpacaEval 2 (LC)
17
+ type: AlpacaEval
18
+ metrics:
19
+ - type: AlpacaEval 2.0
20
+ value: 15.88%
21
+ name: Win Rate
22
+ source:
23
+ url: https://tatsu-lab.github.io/alpaca_eval/
24
+ name: self-reported
25
+ - task:
26
+ type: text-generation
27
+ dataset:
28
+ name: MT-Bench
29
+ type: MT-Bench
30
+ metrics:
31
+ - type: MT-Bench
32
+ value: 7.444
33
+ name: Score
34
+ source:
35
+ url: https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/
36
+ name: self-reported
37
  ---
38
+ # **Mistral-ORPO-Capybara-7k (7B)**
39
+
40
+ **Mistral-ORPO** is a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) using the *[odds ratio preference optimization (ORPO)](https://arxiv.org/abs/2403.07691)*. With ORPO, the model directly learns the preference without the supervised fine-tuning warmup phase.
41
+
42
+ **Mistral-ORPO-ORPO-Capybara-7k** is fine-tuned for **2.5 hours on four A100s** exclusively on the **7k** instances of the distilled Capybara paired multi-turn conversation dataset, [argilla/distilabel-capybara-dpo-7k-binarized](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized), by [Argilla](https://huggingface.co/argilla).
43
+
44
+ - **Github Repository**: https://github.com/xfactlab/orpo
45
+
46
+ ## 👍 **Model Performance**
47
+
48
+ ### 1) AlpacaEval & MT-Bench
49
+
50
+ |Model Name|Size|Align|MT-Bench|AlpacaEval 2.0 (LC)|
51
+ |:--------|:--------------:|:-------------------:|:------------:|:------------:|
52
+ |**Mistral-<tt>ORPO</tt>-Capybara-7k**|7B|<tt>ORPO</tt>|7.44|15.9|
53
+ |**Mistral-<tt>ORPO</tt>-β**|7B|<tt>ORPO</tt>|7.32|14.7|
54
+ |Zephyr β |7B|DPO|7.34|13.2|
55
+ |TULU-2-DPO |13B|DPO|7.00|11.6|
56
+ |Llama-2-Chat |7B|RLHF|6.27|5.4|
57
+ |Llama-2-Chat |13B|RLHF|6.65|8.4|
58
+
59
+ ### 2) IFEval
60
+
61
+ | **Model Type** | **Prompt-Strict** | **Prompt-Loose** | **Inst-Strict** | **Inst-Loose** |
62
+ |--------------------|:-----------------:|:----------------:|:---------------:|:--------------:|
63
+ | **Mistral-ORPO-Capybara-7k** | 0.5083 | 0.5083 | 0.5827 | 0.6127 |
64
+ | **Mistral-ORPO-⍺** | 0.5009 | 0.5083 | 0.5995 | 0.6163 |
65
+ | **Mistral-ORPO-β** | 0.5287 | 0.5564 | 0.6355 | 0.6619 |
66
+
67
+ ## 🗺️ **MT-Bench by Category**
68
+
69
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6415c043486c7c9a5d151583/pmR91-0dpERqVvPqZ_IQg.png)
70
+
71
+ ## 🖥️ **Inference**
72
+
73
+ ```python
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
75
+ model = AutoModelForCausalLM.from_pretrained("kaist-ai/mistral-orpo-capybara-7k")
76
+ tokenizer = AutoTokenizer.from_pretrained("kaist-ai/mistral-orpo-capybara-7k")
77
+ # Apply chat template
78
+ query = [{'role': 'user', 'content': 'Hi! How are you doing?'}]
79
+ prompt = tokenizer.apply_chat_template(query, tokenize=False, add_generation_prompt=True)
80
+ inputs = tokenizer(prompt, return_tensors='pt')
81
+ # Generation with specific configurations
82
+ output = model.generate(
83
+ **inputs,
84
+ max_new_tokens=128,
85
+ do_sample=True,
86
+ temperature=0.7
87
+ )
88
+ response = tokenizer.batch_decode(output)
89
+ #<|user|>
90
+ #Hi! How are you doing?</s>
91
+ #<|assistant|>
92
+ #I'm doing well, thank you! How are you?</s>
93
+ ```
94
+
95
+ ## 📎 **Citation**
96
+
97
+ ```
98
+ @misc{hong2024orpo,
99
+ title={ORPO: Monolithic Preference Optimization without Reference Model},
100
+ author={Jiwoo Hong and Noah Lee and James Thorne},
101
+ year={2024},
102
+ eprint={2403.07691},
103
+ archivePrefix={arXiv},
104
+ primaryClass={cs.CL}
105
+ }
106
+ ```