File size: 5,818 Bytes
4cddfc2
 
c583497
 
4cddfc2
 
9d71a7d
 
 
 
 
 
 
 
 
 
 
 
 
 
5dcba7e
9d71a7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ed5f3c
9d71a7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a840c69
9d71a7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d2c256
9d71a7d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
library_name: transformers
license: apache-2.0
basemodel: Qwen/Qwen1.5-14B
---

## Model Card for Firefly-Qwen1.5-14B-En-Alpha

[firefly-qwen1.5-en-14b-alpha](https://huggingface.co/YeungNLP/firefly-qwen1.5-en-14b-alpha) is a preview version model of our new model.
It outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) and [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge)' single-turn task.

**Note: More importantly, it is not trained with neither SFT nor RLHF, maybe we will share our method later.**

What's exciting is that our experimental method can achieve good performance, even though it's still in a very preliminary stage.

Although our model is trained with English data, you can also try to chat with models in Chinese because Qwen1.5 is also good at Chinese. But we have not evaluated
the performance in Chinese yet.

We advise you to install transformers>=4.37.0.

**Because this is a validation experiment and our training resources are limited, we use QLoRA to train this model based on [Qwen1.5-14B](https://huggingface.co/Qwen/Qwen1.5-14B) with the max length of 1024, it may limit the performance of this model.**

## Performance
We automatically evaluate models on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) and [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) with **gpt-4o**.

We evaluate models on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) with 805 questions, our model outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat).
The win rate is **52.17% : 47.83%**.

| Task          | Ours wins | Qwen1.5-14B-Chat wins |
|---------------|-----------|-----------------------|
| helpful_base  | **67**    | 62                    |
| koala         | **80**    | 76                    |
| oasst         | **100**   | 88                    |
| selfinstruct  | **127**   | 125                   |
| vicuna        | **46**    | 34                    |
| total         | **420**   | 385                   |

We also evaluate models on [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Though the overall performance of our model is not as good as [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat),
**we find that our model outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all single-turn tasks**. Our model is worse than [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all multi-turn tasks.
We conjecture that it may be caused by the training length, and we will dive into this phenomenon later.

Overall Performances on MT-Bench:

| Task              | Ours     | Qwen1.5-14B-Chat  |
|-------------------|----------|-------------------|
| Avg Score         | 7.03     | **7.21**          |
| Single-turn Score | **8.01** | 7.66              |
| Multi-turn Score  | 6.05     | **6.75**          |

Performances on MT-Bench' single-turn tasks:

| Task          | Ours     | Qwen1.5-14B-Chat |
|---------------|----------|------------------|
| writing	      | **9.1**	 | 8.9              |
| roleplay	     | **8.5**  | 	8.3             |
| extraction	   | **8.6**	 | 8.2              |
| stem	         | **8.8**| 	8.5             |
| humanities	   | **9**    | 	8.8             |
| reasoning     | 	**6.8** | 	5.3             |
| math	         | **7.5**  | 	7.1             |
| coding	       | 5.8	     | **6.2**              |

Performances on MT-Bench' multi-turn tasks:

| Task           | Ours     | Qwen1.5-14B-Chat   |
|----------------|----------|--------------------|
| writing        | 	6.5     | 	**7.7**           |
| roleplay       | 	7.7     | 	**8.3**           |
| extraction     | 	5.1     | 	**6.7**           |
| stem           | 	6.3     | 	**6.9**           |
| humanities     | 	8.3     | 	**8.8**           |
| reasoning      | 	4.7     | 	**5.7**           |
| math           | 	4.9     | 	**5.5**          |
| coding         | 	**4.9** | 	4.4               |


## Usage
The chat templates of our chat models are the same as Official Qwen1.5-14B-Chat:
```text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hello, who are you?<|im_end|>
<|im_start|>assistant
I am a AI program developed by Firefly<|im_end|>
```

You can use script to inference in [Firefly](https://github.com/yangjianxin1/Firefly/blob/master/script/chat/chat.py).

You can also use the following code:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name_or_path = "YeungNLP/firefly-qwen1.5-en-14b-alpha"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

prompt = "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. "
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to('cuda')

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1500,
    top_p = 0.8,
    temperature = 0.6,
    repetition_penalty = 1.0,
    eos_token_id=tokenizer.encode('<|im_end|>', add_special_tokens=False)
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```