File size: 58,641 Bytes
523192e
 
130cbe4
 
 
 
 
 
 
 
 
523192e
4c4f26e
 
523192e
130cbe4
 
 
 
 
 
 
 
 
523192e
 
31e6831
130cbe4
 
 
e501c9e
 
130cbe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7bc85f1
130cbe4
 
 
 
 
320443e
130cbe4
 
 
 
 
 
 
 
 
 
 
 
 
 
320443e
 
130cbe4
 
 
 
 
 
320443e
130cbe4
e28ba38
 
320443e
e28ba38
130cbe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1af83cb
130cbe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e501c9e
 
 
 
130cbe4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
320443e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
---
language:
- en
- vi
- es
- pt
- de
- it
- fr
- ko
- zh
license: other
license_name: ghost-open-llms
license_link: https://ghost-x.org/ghost-open-llms-license
tags:
- ghost
- tools
- chat
- transformers
- unsloth
- llama
pipeline_tag: text-generation
widget:
- text: Why is the sky blue ?
---

<p><img src="https://ghost-x.org/docs/models/ghost-8b-beta/images/logo.jpeg" width="40%" align="center" /></p>

A large language model was developed with goals including excellent multilingual support, superior knowledge capabilities and cost efficiency.

> 🎉 We recommend using the new version for a better model experience here, **[Ghost 8B Beta 1608](https://huggingface.co/ghost-x/ghost-8b-beta-1608)**.

## Introduction

**Ghost 8B Beta** is a large language model developed with goals that include excellent multilingual support, superior knowledge capabilities, and cost-effectiveness. The model comes in two context length versions, 8k and 128k, along with multilingual function tools support by default.

The Ghost 8B Beta model outperforms prominent models such as Llama 3 8B Instruct, GPT 3.5 Turbo in the lc_winrate score. In addition, it also outperforms Claude 3 Opus, Claude 3 Sonnet, GPT-4, and Mistral Large when comparing the winrate score of AlpacaEval 2.0, [\*](https://ghost-x.org/docs/models/ghost-8b-beta/#alpacaeval-20).

### Thoughts

We believe that it is possible to optimize language models that are not too large to achieve better capabilities in terms of cross-linguistic understanding and solving complex tasks. The potential of these models is often mentioned as being cost-effective when deployed and operated at the production level for both large businesses and startups. By doing this well, we can partly eliminate worries about the cost of GPUs that hinder the development of useful A.I ideas and products for humans.

### Haloo

Welcome to the multilingual world! 🌎

**🇺🇸 English:**
Hello there! I'm Ghost 8B Beta, a friendly AI assistant ready to help you navigate the world of language. What's the most popular fast food chain in the US? 🍔 (Because who doesn't love a good burger?)

**🇫🇷 French:**
Bonjour! Je suis Ghost 8B Beta, un assistant IA amical prêt à vous aider à explorer le monde des langues. Que pensez-vous de la cuisine française? 🍽 (Parce que la cuisine française est un art!)

**🇮🇹 Italian:**
Ciao! Sono Ghost 8B Beta, un assistente AI amichevole pronto a aiutarvi a esplorare il mondo delle lingue. Qual è la cosa più divertente da fare a Roma? 🏰 (Perché Roma è una città ricca di storia e cultura!)

**🇪🇸 Spanish:**
Hola! Soy Ghost 8B Beta, un asistente IA amigable listo para ayudarte a explorar el mundo de los idiomas. ¿Qué te parece el flamenco? 🎶 (Porque el flamenco es una forma de arte vibrante y emocional!)

**🇵🇹 Portuguese:**
Olá! Sou Ghost 8B Beta, um assistente IA amigável pronto para ajudar você a explorar o mundo das línguas. O que você acha do futebol brasileiro? 🏟️ (Porque o futebol brasileiro é uma paixão nacional!)

**🇩🇪 German:**
Hallo! Ich bin Ghost 8B Beta, ein freundlicher AI-Assistent, der Ihnen beim Erkunden der Welt der Sprachen helfen möchte. Was ist das beliebteste Getränk in Deutschland? 🍻 (Weil das Bier in Deutschland eine Kultur ist!)

**🇻🇳 Vietnamese:**
Xin chào! Tôi là Ghost 8B Beta, một trợ lý AI thân thiện muốn giúp bạn khám phá thế giới ngôn ngữ. Bạn có thích ăn phở không? 🍜 (Vì phở là một món ăn đặc trưng của Việt Nam!)

**🇰🇷 Korean:**
안녕하세요! 저는 Ghost 8B Beta, 친절한 AI 어시스턴트입니다. 언어 세계를 탐험하는 데 도움을 드릴게요. 한국의 대표적인 음식은 무엇인가요? 🍜 (한국 음식은 맛과 향이 풍부해서 꼭 한번 맛보세요!)

**🇨🇳 Chinese:**
你好! 我是 Ghost 8B Beta,一位友好的 AI 助手,欢迎您来到语言的世界。您觉得中国菜怎么样? 🍜 (中国菜种类繁多,口味丰富,相信您一定会喜欢!)

### Specifications

- Name: **Ghost 8B Beta (aka: Llama 3 - Ghost 8B Beta)**.
- Version: **disl-0x5 (aka: d0x5)**.
- Model size: 8 billion parameters.
- Context length: 8K, 8192 / 128K, 131072.
- Languages: 🇺🇸 English, 🇫🇷 French, 🇮🇹 Italian, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇩🇪 German, 🇻🇳 Vietnamese, 🇰🇷 Korean, 🇨🇳 Chinese.
- Main tasks: as a pretrained model, chat, multi-tasking and function tools.
- License: [Ghost Open LLMs LICENSE](https://ghost-x.org/ghost-open-llms-license), [Llama 3 LICENSE](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE).
- Distributions: Standard (BF16), GGUF, AWQ.
- Developed by: **Ghost X**, [Hieu Lam](https://huggingface.co/lamhieu).

#### Links

- Official website: [Ghost 8B Beta](https://ghost-x.org/docs/models/ghost-8b-beta/).
- Online demo:
  - [Playground with Ghost 8B Beta (β, 8k) on Spaces](https://huggingface.co/spaces/lamhieu/ghost-8b-beta-8k).
  - [Playground with Ghost 8B Beta (β, 128k) on Spaces](https://huggingface.co/spaces/lamhieu/ghost-8b-beta-128k).

### Distributions

We create many distributions to give you the best access options that best suit your needs.

| Version | Model card                                                          |
| ------- | ------------------------------------------------------------------- |
| BF16    | [🤗 HuggingFace](https://huggingface.co/ghost-x/ghost-8b-beta)      |
| GGUF    | [🤗 HuggingFace](https://huggingface.co/ghost-x/ghost-8b-beta-gguf) |
| AWQ     | [🤗 HuggingFace](https://huggingface.co/ghost-x/ghost-8b-beta-awq)  |

### License

The Ghost 8B Beta model is released under the [Ghost Open LLMs LICENSE](https://ghost-x.org/ghost-open-llms-license), [Llama 3 LICENSE](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE).

* For individuals, the model is free to use for personal and research purposes.
* For commercial use of Ghost 8B Beta, it's also free, but please contact us for confirmation. You can email us at "lamhieu.vk [at] gmail.com" with a brief introduction of your project. If possible, include your logo so we can feature it as a case study. We will confirm your permission to use the model. For commercial use as a service, no email confirmation is needed, but we'd appreciate a notification so we can keep track and potentially recommend your services to partners using the model.

Additionally, it would be great if you could mention or credit the model when it benefits your work.

### Techniques

The model is further trained based on [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B).

**Ghost 8B Beta** has been continual pre-training and fine-tuned with a recipe given the funny name "Teach the little boy how to cook Saigon Pho".

This technique is designed to include model training with 3 main stages:

- In stages 1 and 2, the "multilingual buffer" method (in short) will be used to perform fine-tuning. This method helps achieve efficiency in terms of language understanding and knowledge sharing for the model with low training costs and does not require too much sample data.
- In stage 3, the model will be refined based on human feedback.

Note, with this recipe we are able to reproduce this model exactly. More specifically, all training source code (both forks and patches to libraries) is archived and reproducible.

## Templates / Concepts

The Ghost 8B Beta model comes with a built-in "chat template" if used via "transformers" so everything is quite simple. Here we will learn what the model supports and how to optimize its use to get the best quality of generated text.

The model supports the following roles:

- "system": This is the system role that sets information about the model or instructs the model to follow commands. It should be placed at the top and set empty if not in use.
- "user": This is the role that represents the user's chat content.
- "assistant": This is the role that represents the model's chat content.
- "refs": This is the role that represents trusted reference data for the model (RAG usage). It must be placed after "system" and after "tool:schemas" if tools are used.
- "tool:schemas": This is a role to enter information about the tools that the model is allowed to use during work. It must be placed after "system" and can be omitted when not in use.
- "tool:execute": This is a role to rewrite the model's tool execution action.
- "tool:results": This is the role to write down the results received after using the tool.

<details close>
  <summary>See chat configuration template</summary>
  ~

```jinja
{{ bos_token }}{% for message in messages %}
{% set role = message['role'] %}
{% if role == 'tools' %}
{% set role = 'tool:schemas' %}
{% elif role == 'execute' %}
{% set role = 'tool:execute' %}
{% elif role == 'response' %}
{% set role = 'tool:results' %}
{% endif %}
{% set content = message['content'] | trim + '<|cos|>' %}
{% if role == 'system' %}
{% set content = '<|role:begin|>system<|role:end|>
' + content %}
{% elif role == 'user' %}
{% set content = '<|role:begin|>user<|role:end|>
' + content %}
{% elif role == 'assistant' %}
{% set content = '<|role:begin|>assistant<|role:end|>
' + content %}
{% elif role == 'refs' %}
{% set content = '<|role:begin|>references<|role:end|>
' + content %}
{% elif role == 'tool:schemas' %}
{% set content = '<|role:begin|>tools<|role:end|>
' + content %}
{% elif role == 'tool:execute' %}
{% set content = '<|role:begin|>assistant<|role:end|>
<|tool:execute|>' + content %}
{% elif role == 'tool:results' %}
{% set content = '<|role:begin|>user<|role:end|>
<|tool:results|>' + content %}
{% endif %}
{{ content }}
{% if loop.last and add_generation_prompt %}
{{ '<|role:begin|>assistant<|role:end|>' }}
{% endif %}
{% endfor %}
```

</details>

To understand better, let's see [how to use](#usage) details.

## Usage

This content will be updated soon.

### Directly

To use the model directly, there are many ways to get started, choose one of the following ways to experience it.

#### Transformers

For direct use with `transformers`, you can easily get started with the following steps.

- Firstly, you need to install transformers via the command below with `pip`.

  ```bash
  pip install -U transformers
  ```

- Right now, you can start using the model directly.

  ```python
  import torch
  from transformers import (
      AutoModelForCausalLM,
      AutoTokenizer,
  )

  base_model = "ghost-x/ghost-8b-beta"
  model = AutoModelForCausalLM.from_pretrained(
      base_model,
      torch_dtype=torch.bfloat16,
      device_map="auto",
  )
  tokenizer = AutoTokenizer.from_pretrained(base_model)

  messages = [
      {"role": "system", "content": ""},
      {"role": "user", "content": "Why is the sky blue ?"},
  ]
  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
  for k,v in inputs.items():
      inputs[k] = v.cuda()
  outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, top_k=50, top_p=0.95, temperature=0.4)
  results = tokenizer.batch_decode(outputs)[0]
  print(results)
  ```

- Additionally, you can also use a model with **4bit quantization** to reduce the required resources at least. You can start with the code below.

  ```python
  import torch
  from transformers import (
      AutoModelForCausalLM,
      AutoTokenizer,
      BitsAndBytesConfig,
  )

  base_model = "ghost-x/ghost-8b-beta"
  bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_compute_dtype=torch.bfloat16,
      bnb_4bit_use_double_quant=False,
  )
  model = AutoModelForCausalLM.from_pretrained(
      base_model,
      quantization_config=bnb_config,
      device_map="auto",
  )
  tokenizer = AutoTokenizer.from_pretrained(base_model)

  messages = [
      {"role": "system", "content": ""},
      {"role": "user", "content": "Why is the sky blue ?"},
  ]
  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
  for k,v in inputs.items():
      inputs[k] = v.cuda()
  outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, top_k=50, top_p=0.95, temperature=0.4)
  results = tokenizer.batch_decode(outputs)[0]
  print(results)

  ```

#### Unsloth

For direct use with `unsloth`, you can easily get started with the following steps.

- Firstly, you need to install unsloth via the command below with `pip`.

  ```python
  import torch
  major_version, minor_version = torch.cuda.get_device_capability()
  # Must install separately since Colab has torch 2.2.1, which breaks packages
  !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
  if major_version >= 8:
      # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
      !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
  else:
      # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
      !pip install --no-deps xformers trl peft accelerate bitsandbytes
  pass
  ```

- Initialize and optimize the model before use.

  ```python
  from unsloth import FastLanguageModel
  import os, torch, pprint

  base_model = "ghost/ghost-8b-beta"
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name = base_model,
      max_seq_length = 8192,
      dtype = None,
      load_in_4bit = False, # Change to `True` if you want to use 4bit quantization.
  )
  FastLanguageModel.for_inference(model)
  ```

- Right now, you can start using the model directly.
  ```python
  messages = [
      {"role": "system", "content": ""},
      {"role": "user", "content": "Why is the sky blue ?"},
  ]
  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
  for k,v in inputs.items():
      inputs[k] = v.cuda()
  outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, top_k=50, top_p=0.95, temperature=0.4)
  results = tokenizer.batch_decode(outputs)[0]
  print(results)
  ```

### Instructions

Here are specific instructions and explanations for each use case.

#### Dialogue

This is an example of a multi-turns conversation between the assistant and the user.

Let's start with "messages" with system and user questions as follows:

```python
messages = [
  {"role": "system", "content": ""},
  {"role": "user", "content": "What is the significance of the Higgs boson in the Standard Model of particle physics?"},
]
```

Then adding the assistant's answer generated the same new question from the user.

```python
messages = [
  {"role": "system", "content": ""},
  {"role": "user", "content": "What is the significance of the Higgs boson in the Standard Model of particle physics?"},
  {"role": "assistant", "content": "..."},
  {"role": "user", "content": "Identify the author of a novel that features a dystopian society where \"Big Brother\" watches over its citizens and the protagonist works for the Ministry of Truth."},
]
```

Done, that's how to create ongoing conversations with chat templates.

#### Retrieval-Augmented Generation

For "Retrieval-Augmented Generation", the best way to add reference information without having to change the prompt system is to use the "refs" role, see the following example.

```python
refs = json.dumps(
  {
    "instructions": "These are only general documents used for reference to give the most accurate and honest answers possible. Ignore it if it's irrelevant and don't overuse it.",
    "documents": [
      # The content of the reference documents is here.
      # The model will only rely on reference content to answer if relevant.
      "Paracetamol poisoning, also known as acetaminophen poisoning, is caused by excessive use of the medication paracetamol (acetaminophen). Most people have few or non-specific symptoms in the first 24 hours following overdose. These include feeling tired, abdominal pain, or nausea. This is typically followed by a couple of days without any symptoms, after which yellowish skin, blood clotting problems, and confusion occurs as a result of liver failure. Additional complications may include kidney failure, pancreatitis, low blood sugar, and lactic acidosis. If death does not occur, people tend to recover fully over a couple of weeks. Without treatment, death from toxicity occurs 4 to 18 days later.Paracetamol poisoning can occur accidentally or as an attempt to die by suicide. Risk factors for toxicity include alcoholism, malnutrition, and the taking of certain other hepatotoxic medications. Liver damage results not from paracetamol itself, but from one of its metabolites, N-acetyl-p-benzoquinone imine (NAPQI). NAPQI decreases the livers glutathione and directly damages cells in the liver. Diagnosis is based on the blood level of paracetamol at specific times after the medication was taken. These values are often plotted on the Rumack-Matthew nomogram to determine level of concern.Treatment may include activated charcoal if the person seeks medical help soon after the overdose. Attempting to force the person to vomit is not recommended. If there is a potential for toxicity, the antidote acetylcysteine is recommended. The medication is generally given for at least 24 hours. Psychiatric care may be required following recovery. A liver transplant may be required if damage to the liver becomes severe. The need for transplant is often based on low blood pH, high blood lactate, poor blood clotting, or significant hepatic encephalopathy. With early treatment liver failure is rare. Death occurs in about 0.1% of cases.Paracetamol poisoning was first described in the 1960s. Rates of poisoning vary significantly between regions of the world. In the United States more than 100,000 cases occur a year. In the United Kingdom it is the medication responsible for the greatest number of overdoses. Young children are most commonly affected. In the United States and the United Kingdom, paracetamol is the most common cause of acute liver failure. Signs and symptoms The signs and symptoms of paracetamol toxicity occur in three phases. The first phase begins within hours of overdose, and consists of nausea, vomiting, a pale appearance, and sweating. However, patients often have no specific symptoms or only mild symptoms in the first 24 hours of poisoning. Rarely, after massive overdoses, patients may develop symptoms of metabolic acidosis and coma early in the course of poisoning.The second phase occurs between 24 hours and 72 hours following overdose and consists of signs of increasing liver damage. In general, damage occurs in liver cells as they metabolize the paracetamol. The individual may experience right upper quadrant abdominal pain. The increasing liver damage also changes biochemical markers of liver function; International normalized ratio (INR) and the liver transaminases ALT and AST rise to abnormal levels. Acute kidney failure may also occur during this phase, typically caused by either hepatorenal syndrome or multiple organ dysfunction syndrome. In some cases, acute kidney failure may be the primary clinical manifestation of toxicity. In these cases, it has been suggested that the toxic metabolite is produced more in the kidneys than in the liver.The third phase follows at 3 to 5 days, and is marked by complications of massive liver necrosis leading to fulminant liver failure with complications of coagulation defects, low blood sugar, kidney failure, hepatic encephalopathy, brain swelling, sepsis, multiple organ failure, and death. If the third phase is survived, the liver necrosis runs its course, and liver and kidney function typically return to normal in a few weeks. The severity of paracetamol toxicity varies depending on the dose and whether appropriate treatment is received. Cause The toxic dose of paracetamol is highly variable. In general the recommended maximum daily dose for healthy adults is 4 grams. Higher doses lead to increasing risk of toxicity. In adults, single doses above 10 grams or 200 mg/kg of bodyweight, whichever is lower, have a reasonable likelihood of causing toxicity. Toxicity can also occur when multiple smaller doses within 24 hours exceed these levels. Following a dose of 1 gram of paracetamol four times a day for two weeks, patients can expect an increase in alanine transaminase in their liver to typically about three times the normal value. It is unlikely that this dose would lead to liver failure. Studies have shown significant hepatotoxicity is uncommon in patients who have taken greater than normal doses over 3 to 4 days. In adults, a dose of 6 grams a day over the preceding 48 hours could potentially lead to toxicity, while in children acute doses above 200 mg/kg could potentially cause toxicity. Acute paracetamol overdose in children rarely causes illness or death, and it is very uncommon for children to have levels that require treatment, with chronic larger-than-normal doses being the major cause of toxicity in children.Intentional overdosing (self-poisoning, with suicidal intent) is frequently implicated in paracetamol toxicity. In a 2006 review, paracetamol was the most frequently ingested compound in intentional overdosing.In rare individuals, paracetamol toxicity can result from normal use. This may be due to individual ("idiosyncratic") differences in the expression and activity of certain enzymes in one of the metabolic pathways that handle paracetamol (see paracetamols metabolism). Risk factors A number of factors can potentially increase the risk of developing paracetamol toxicity. Chronic excessive alcohol consumption can induce CYP2E1, thus increasing the potential toxicity of paracetamol. In one study of patients with liver injury, 64% reported alcohol intakes of greater than 80 grams a day, while 35% took 60 grams a day or less. Whether chronic alcoholism should be considered a risk factor has been debated by some clinical toxicologists. For chronic alcohol users, acute alcohol ingestion at the time of a paracetamol overdose may have a protective effect. For non-chronic alcohol users, acute alcohol consumption had no protective effect. Fasting is a risk factor, possibly because of depletion of liver glutathione reserves. The concomitant use of the CYP2E1 inducer isoniazid increases the risk of hepatotoxicity, though whether 2E1 induction is related to the hepatotoxicity in this case is unclear. Concomitant use of other drugs that induce CYP enzymes, such as antiepileptics including carbamazepine, phenytoin, and barbiturates, have also been reported as risk factors. Pathophysiology When taken in normal therapeutic doses, paracetamol has been shown to be safe. Following a therapeutic dose, it is mostly converted to nontoxic metabolites via Phase II metabolism by conjugation with sulfate and glucuronide, with a small portion being oxidized via the cytochrome P450 enzyme system. Cytochromes P450 2E1 and 3A4 convert approximately 5% of paracetamol to a highly reactive intermediary metabolite, N-acetyl-p-benzoquinone imine (NAPQI). Under normal conditions, NAPQI is detoxified by conjugation with glutathione to form cysteine and mercapturic acid conjugates.In cases of paracetamol overdose, the sulfate and glucuronide pathways become saturated, and more paracetamol is shunted to the cytochrome P450 system to produce NAPQI. As a result, hepatocellular supplies of glutathione become depleted, as the demand for glutathione is higher than its regeneration. NAPQI therefore remains in its toxic form in the liver and reacts with cellular membrane molecules, resulting in widespread hepatocyte damage and death, leading to acute liver necrosis. In animal studies, the livers stores of glutathione must be depleted to less than 70% of normal levels before liver toxicity occurs. Diagnosis A persons history of taking paracetamol is somewhat accurate for the diagnosis. The most effective way to diagnose poisoning is by obtaining a blood paracetamol level. A drug nomogram developed in 1975, called the Rumack-Matthew nomogram, estimates the risk of toxicity based on the serum concentration of paracetamol at a given number of hours after ingestion. To determine the risk of potential hepatotoxicity, the paracetamol level is traced along the nomogram. Use of a timed serum paracetamol level plotted on the nomogram appears to be the best marker indicating the potential for liver injury. A paracetamol level drawn in the first four hours after ingestion may underestimate the amount in the system because paracetamol may still be in the process of being absorbed from the gastrointestinal tract. Therefore, a serum level taken before 4 hours is not recommended.Clinical or biochemical evidence of liver toxicity may develop in one to four days, although, in severe cases, it may be evident in 12 hours. Right-upper-quadrant tenderness may be present and can aid in diagnosis. Laboratory studies may show evidence of liver necrosis with elevated AST, ALT, bilirubin, and prolonged coagulation times, particularly an elevated prothrombin time. After paracetamol overdose, when AST and ALT exceed 1000 IU/L, paracetamol-induced hepatotoxicity can be diagnosed. In some cases, the AST and ALT levels can exceed 10,000 IU/L. Detection in body fluids Paracetamol may be quantified in blood, plasma, or urine as a diagnostic tool in clinical poisoning situations or to aid in the medicolegal investigation of suspicious deaths. The concentration in serum after a typical dose of paracetamol usually peaks below 30 mg/L, which equals 200 μmol/L. Levels of 30–300 mg/L (200–2000 μmol/L) are often observed in overdose patients. Postmortem blood levels have ranged from 50 to 400 mg/L in persons dying due to acute overdosage. Automated colorimetric techniques, gas chromatography and liquid chromatography are currently in use for the laboratory analysis of the drug in physiological specimens. Prevention Limitation of availability Limiting the availability of paracetamol tablets has been attempted in some countries. In the UK, sales of over-the-counter paracetamol are restricted to packs of 32 x 500 mg tablets in pharmacies, and 16 x 500 mg tablets in non-pharmacy outlets. Pharmacists may provide up to 100 tablets for those with chronic conditions at the pharmacists discretion. In Ireland, the limits are 24 and 12 tablets, respectively. Subsequent study suggests that the reduced availability in large numbers had a significant effect in reducing poisoning deaths from paracetamol overdose.One suggested method of prevention is to make paracetamol a prescription-only medicine, or to remove it entirely from the market. However, overdose is a relatively minor problem; for example, 0.08% of the UK population (over 50 thousand people) present with paracetamol overdose each year. In contrast, paracetamol is a safe and effective medication that is taken without complications by millions of people. In addition, alternative pain relief medications such as aspirin are more toxic in overdose, whereas non-steroidal anti-inflammatory drugs are associated with more adverse effects following normal use. Combination with other agents One strategy for reducing harm done by acetaminophen overdoses is selling paracetamol pre-combined in tablets either with an emetic or an antidote. Paradote was a tablet sold in the UK which combined 500 mg paracetamol with 100 mg methionine, an amino acid formerly used in the treatment of paracetamol overdose. There have been no studies so far on the effectiveness of paracetamol when given in combination with its most commonly used antidote, acetylcysteine.Calcitriol, the active metabolite of vitamin D3, appears to be a catalyst for glutathione production. Calcitriol was found to increase glutathione levels in rat astrocyte primary cultures on average by 42%, increasing glutathione protein concentrations from 29 nmol/mg to 41 nmol/mg, 24 and 48 hours after administration; it continued to have an influence on glutathione levels 96 hours after administration. It has been proposed that co-administration of calcitriol, via injection, may improve treatment outcomes. Paracetamol replacements Paracetamol ester prodrug with L-pyroglutamic acid (PCA), a biosynthetic precursor of glutathione, has been synthesized to reduce paracetamol hepatotoxicity and improve bioavailability. The toxicological studies of different paracetamol esters show that L-5-oxo-pyrrolidine-2-paracetamol carboxylate reduces toxicity after administration of an overdose of paracetamol to mice. The liver glutathione values in mice induced by intraperitoneal injection of the ester are superimposable with the GSH levels recorded in untreated mice control group. The mice group treated with an equivalent dose of paracetamol showed a significative decrease of glutathione of 35% (p<0.01 vs untreated control group). The oral LD50 was found to be greater than 2000 mg kg-1, whereas the intraperitoneal LD50 was 1900 mg kg-1. These results taken together with the good hydrolysis and bioavailability data show that this ester is a potential candidate as a prodrug of paracetamol. Treatment Gastric decontamination In adults, the initial treatment for paracetamol overdose is gastrointestinal decontamination. Paracetamol absorption from the gastrointestinal tract is complete within two hours under normal circumstances, so decontamination is most helpful if performed within this timeframe. Gastric lavage, better known as stomach pumping, may be considered if the amount ingested is potentially life-threatening and the procedure can be performed within 60 minutes of ingestion. Activated charcoal is the most common gastrointestinal decontamination procedure as it adsorbs paracetamol, reducing its gastrointestinal absorption. Administering activated charcoal also poses less risk of aspiration than gastric lavage.It appears that the most benefit from activated charcoal is gained if it is given within 30 minutes to two hours of ingestion. Administering activated charcoal later than 2 hours can be considered in patients that may have delayed gastric emptying due to co-ingested drugs or following ingestion of sustained- or delayed-release paracetamol preparations. Activated charcoal should also be administered if co-ingested drugs warrant decontamination. There was reluctance to give activated charcoal in paracetamol overdose, because of the concern that it may also absorb the oral antidote acetylcysteine. Studies have shown that 39% less acetylcysteine is absorbed into the body when they are administered together. There are conflicting recommendations regarding whether to change the dosing of oral acetylcysteine after the administration of activated charcoal, and even whether the dosing of acetylcysteine needs to be altered at all. Intravenous acetylcysteine has no interaction with activated charcoal. Inducing vomiting with syrup of ipecac has no role in paracetamol overdose because the vomiting it induces delays the effective administration of activated charcoal and oral acetylcysteine. Liver injury is extremely rare after acute accidental ingestion in children under 6 years of age. Children with accidental exposures do not require gastrointestinal decontamination with either gastric lavage, activated charcoal, or syrup of ipecac. Acetylcysteine Acetylcysteine, also called N-acetylcysteine or NAC, works to reduce paracetamol toxicity by replenishing body stores of the antioxidant glutathione. Glutathione reacts with the toxic NAPQI metabolite so that it does not damage cells and can be safely excreted. NAC was usually given following a treatment nomogram (one for patients with risk factors, and one for those without) but the use of the nomogram is no longer recommended as the evidence base to support the use of risk factors was poor and inconsistent and many of the risk factors are imprecise and difficult to determine with sufficient certainty in clinical practice. Cysteamine and methionine have also been used to prevent hepatotoxicity, although studies show that both are associated with more adverse effects than acetylcysteine. Additionally, acetylcysteine has been shown to be a more effective antidote, particularly in patients presenting greater than 8 hours post-ingestion and for those who present with liver failure symptoms.If the person presents less than eight hours after paracetamol overdose, then acetylcysteine significantly reduces the risk of serious hepatotoxicity and guarantees survival. If acetylcysteine is started more than 8 hours after ingestion, there is a sharp decline in its effectiveness because the cascade of toxic events in the liver has already begun, and the risk of acute liver necrosis and death increases dramatically. Although acetylcysteine is most effective if given early, it still has beneficial effects if given as late as 48 hours after ingestion. If the person presents more than eight hours after the paracetamol overdose, then activated charcoal is not useful, and acetylcysteine is started immediately. In earlier presentations, charcoal can be given when the patient arrives and acetylcysteine is initiated while waiting for the paracetamol level results to return from the laboratory.In United States practice, intravenous (IV) and oral administration are considered to be equally effective and safe if given within 8 hours of ingestion. However, IV is the only recommended route in Australasian and British practice. Oral acetylcysteine is given as a 140 mg/kg loading dose followed by 70 mg/kg every four hours for 17 more doses, and if the patient vomits within 1 hour of dose, the dose must be repeated. Oral acetylcysteine may be poorly tolerated due to its unpleasant taste, odor, and its tendency to cause nausea and vomiting. If repeated doses of charcoal are indicated because of another ingested drug, then subsequent doses of charcoal and acetylcysteine should be staggered.Intravenous acetylcysteine is given as a continuous infusion over 20 hours for a total dose 300 mg/kg. Recommended administration involves infusion of a 150 mg/kg loading dose over 15 to 60 minutes, followed by a 50 mg/kg infusion over four hours; the last 100 mg/kg are infused over the remaining 16 hours of the protocol. Intravenous acetylcysteine has the advantage of shortening hospital stay, increasing both doctor and patient convenience, and allowing administration of activated charcoal to reduce absorption of both the paracetamol and any co-ingested drugs without concerns about interference with oral acetylcysteine. Intravenous dosing varies with weight, specifically in children. For patients less than 20 kg, the loading dose is 150 mg/kg in 3 mL/kg diluent, administered over 60 minutes; the second dose is 50 mg/kg in 7 mL/kg diluent over 4 hours; and the third and final dose is 100 mg/kg in 14 mL/kg diluent over 16 hours.The most common adverse effect to acetylcysteine treatment is an anaphylactoid reaction, usually manifested by rash, wheeze, or mild hypotension. May cause infertility or death. Adverse reactions are more common in people treated with IV acetylcysteine, occurring in up to 20% of patients. Anaphylactoid reactions are more likely to occur with the first infusion (the loading dose). Rarely, severe life-threatening reactions may occur in predisposed individuals, such as patients with asthma or atopic dermatitis, and may be characterized by respiratory distress, facial swelling, and even death.If an anaphylactoid reaction occurs the acetylcysteine is temporarily halted or slowed and antihistamines and other supportive care is administered. For example, a nebulised beta-agonist like salbutamol may be indicated in the event of significant bronchospasm (or prophylactically in patients with a history of bronchospasm secondary to acetylcysteine). It is also important to closely monitor fluids and electrolytes. Liver transplant In people who develop acute liver failure or who are otherwise expected to die from liver failure, the mainstay of management is liver transplantation. Liver transplants are performed in specialist centers. The most commonly used criteria for liver transplant were developed by physicians at Kings College Hospital in London. Patients are recommended for transplant if they have an arterial blood pH less than 7.3 after fluid resuscitation or if a patient has Grade III or IV encephalopathy, a prothrombin time greater than 100 seconds, and a serum creatinine greater than 300 mmol/L In a 24-hour period. Other forms of liver support have been used including partial liver transplants. These techniques have the advantage of supporting the patient while their own liver regenerates. Once liver function returns immunosuppressive drugs are commenced and they have to take immunosuppressive medication for the rest of their lives. Prognosis The mortality rate from paracetamol overdose increases two days after the ingestion, reaches a maximum on day four, and then gradually decreases. Acidosis is the most important single indicator of probable mortality and the need for transplantation. A mortality rate of 95% without transplant was reported in patients who had a documented pH less than 7.30. Other indicators of poor prognosis include chronic kidney disease (stage 3 or worse), hepatic encephalopathy, a markedly elevated prothrombin time, or an elevated blood lactic acid level (lactic acidosis). One study has shown that a factor V level less than 10% of normal indicated a poor prognosis (91% mortality), whereas a ratio of factor VIII to factor V of less than 30 indicated a good prognosis (100% survival). Patients with a poor prognosis are usually identified for likely liver transplantation. Patients that do not die are expected to fully recover and have a normal life expectancy and quality of life. Epidemiology Many over-the-counter and prescription-only medications contain paracetamol. Because of its wide availability paired with comparably high toxicity, (compared to ibuprofen and aspirin) there is a much higher potential for overdose. Paracetamol toxicity is one of the most common causes of poisoning worldwide. In the United States, the United Kingdom, Australia, and New Zealand, paracetamol is the most common cause of drug overdoses. Additionally, in both the United States and the United Kingdom it is the most common cause of acute liver failure.In England and Wales an estimated 41,200 cases of paracetamol poisoning occurred in 1989 to 1990, with a mortality of 0.40%. It is estimated that 150 to 200 deaths and 15 to 20 liver transplants occur as a result of poisoning each year in England and Wales. Paracetamol overdose results in more calls to poison control centers in the US than overdose of any other pharmacological substance, accounting for more than 100,000 calls, as well as 56,000 emergency room visits, 2,600 hospitalizations, and 458 deaths due to acute liver failure per year. A study of cases of acute liver failure between November 2000 and October 2004 by the Centers for Disease Control and Prevention in the USA found that paracetamol was the cause of 41% of all cases in adults, and 25% of cases in children. References External links Gerth, Jeff; T. Christian Miller (September 20, 2013). \"Use Only as Directed\". ProPublica. Retrieved October 12, 2013."
    ],
  },
  indent=2,
  ensure_ascii=False,
)

messages = [
  {"role": "system", "content": ""},
  {"role": "refs", "content": refs},
  {"role": "user", "content": "What are the signs of Paracetamol poisoning?"},
]
```

#### Tools

To use function tools, first declare the schemas of the tools to Ghost 8B Beta through the role "tool:schemas". Let's look at the example to make it easier to understand.

```python
tools = json.dumps(
  [
    {
      "type": "function",
      "function": {
        "name": "get_flight_status",
        "description": "Get the status of a specific flight",
        "parameters": {
          "type": "object",
          "properties": {
            "flight_number": {
              "type": "string",
              "description": "The flight number"
            }
          }
        }
      }
    }
  ],
  indent=2,
  ensure_ascii=False,
)

messages = [
  {"role": "system", "content": ""},
  {"role": "tool:schemas", "content": tools},
  {"role": "user", "content": "What is the status of flight AZ123?"},
]
```

In the above example, the model will return the execution information of the tool to use, which is an Object with a structure similar to the content of the "tool:execute" role in the next example. We need to extract and parse it, then call to get the returned results and pass it through the "tool:results" role. Note, "tool:execute" must be passed again. Then, the model's answer will be received back.

```python
messages = [
  {"role": "system", "content": ""},
  {"role": "tool:schemas", "content": tools},
  {"role": "user", "content": "What is the status of flight AZ123?"},
  {
    "role": "tool:execute",
    "content": json.dumps({
      "type": "function",
      "name": "get_flight_status",
      "arguments": { "flight_number": "AZ123" }
    }, indent=2, ensure_ascii=False)
  },
  {
    "role": "too:results",
    "content": json.dumps({
      "type": "function",
      "name": "get_flight_status",
      "results": {
        "flight_number": "AZ123",
        "departure_airport": "FCO",
        "arrival_airport": "JFK",
        "departure_time": "10:00 AM",
        "arrival_time": "1:00 PM",
        "status": "On time",
        "planned_route": [
          { "latitude": 41.8, "longitude": 12.25 },
          { "latitude": 48.8582, "longitude": 2.2945 },
          { "latitude": 40.6413, "longitude": -73.7781 }
        ]
      }
    }, indent=2, ensure_ascii=False)
  },
]
```

Done, here is an example that simulates how to use function tools. Note, Ghost 8B Beta has support for calling multiple tools at once.

### Deployments

For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:

- You can install vLLM by running the following command.

  ```bash
  pip install "vllm>=0.4.3"
  ```

- Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:

  ```bash
  python -m vllm.entrypoints.openai.api_server --served-model-name ghost-8b-beta --model ghost-x/ghost-8b-beta
  ```

- Try it now:

  ```bash
  curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "ghost-8b-beta",
    "messages": [
        {"role": "system", "content": ""},
        {"role": "user", "content": "Why is the sky blue ?"}
    ]
    }'
  ```

Done, just enjoy it.

## Evaluation

At an overall level, to evaluate whether a model is effective and high quality or not, it must go through rigorous evaluations that are often trusted by the community such as AlpacaEval 2, MT-Bench, MMLU-Pro, GSM8K, etc.

We use the [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) library for evaluations to perform benchmarks for common tasks, except for specific tasks such as AlpacaEval 2.0 and MT Bench.

Once again, Ghost 8B Beta is a language model with a fairly modest size with the goal to achieve multilingual capabilities (9+, most popular languages), support for function tools with those languages. Furthermore, it also has the ambition to be equal or superior to other models with a size many times larger than the Ghost 8B Beta.

Note, to run evaluations for the model, set "max_new_tokens" higher to be able to get the answer because the model will explain before drawing conclusions.

### AlpacaEval 2.0

Overview based on evaluation results from this AlpacaEval 2.0:

- In the Length-controlled win rates score (namely: lc_winrate). The Ghost 8B Beta model outperformed outstanding models such as the Llama 3 8B Instruct, GPT 3.5 Turbo (06/13) and GPT 3.5 Turbo (06/11). In addition, its score is quite close to larger models such as Mixtral 8x7B v0.1 and Gemini Pro.
- In the win rates score (namely: winrate). The Ghost 8B Beta model outperformed Claude 3 Opus (02/29), Claude 3 Sonnet (02/29), GPT-4, GPT-4 (03/14), and Mistral Large (02/24).

Of course, the AlpacaEval 2.0 review highlights the "Length-controlled (LC) win rates" score, so the "win rates" here are for reference only.

| model                    | avg length | lc winrate | winrate   | std error |
| ------------------------ | ---------- | ---------- | --------- | --------- |
| GPT-4 Preview (11/06)    | 2049       | 50.00      | 50.00     | 0.00      |
| Claude 3 Opus (02/29)    | 1388       | 40.51      | 29.11     | 1.39      |
| Claude 3 Sonnet (02/29)  | 1420       | 34.87      | 25.56     | 1.34      |
| Llama 3 70B Instruct     | 1919       | 34.42      | 33.18     | 1.39      |
| Gemini Pro               | 1456       | 24.38      | 18.18     | 1.16      |
| Mixtral 8x7B v0.1        | 1465       | 23.69      | 18.26     | 1.19      |
| **Ghost 8B Beta (d0x5)** | **2430**   | **23.12**  | **29.14** | **1.32**  |
| Llama 3 8B Instruct      | 1899       | 22.92      | 22.57     | 1.26      |
| GPT 3.5 Turbo (06/13)    | 1331       | 22.35      | 14.09     | 0.00      |
| GPT 3.5 Turbo (11/06)    | 796        | 19.30      | 9.17      | 0.00      |
| Mistral 7B Instruct v0.2 | 1676       | 17.11      | 14.72     | 1.08      |

#### A quick talk about AlpacaEval 2.0

[AlpacaEval: An Automatic Evaluator for Instruction-following Language Models](https://github.com/tatsu-lab/alpaca_eval)

![LC AlpacaEval is the most highly correlated benchmark with Chat Arena.](https://github.com/tatsu-lab/alpaca_eval/blob/main/figures/chat_correlations_no_ae.png?raw=true)

AlpacaEval 2.0 with length-controlled win-rates (paper) has a spearman correlation of 0.98 with ChatBot Arena while costing less than $10 of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (< 5min), cheap (< $10), and highly correlated with humans (0.98). Here's a comparison with other benchmarks: LC AlpacaEval is the most highly correlated benchmark with Chat Arena.

```tex
@article{dubois2024length,
  title={Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators},
  author={Dubois, Yann and Galambosi, Bal{\'a}zs and Liang, Percy and Hashimoto, Tatsunori B},
  journal={arXiv preprint arXiv:2404.04475},
  year={2024}
}
```

### MT Bench

Based on MT Bench average scores, the Ghost 8B Beta scores very close to other much larger models, the GPT 3.5 Turbo and Claude v1. In addition, it outperforms other open models of the same size, Vicuna 33B v1.3, WizardLM 30B and Llama 2 70B chat.

**Average Scores**

| model                    | score        |
| ------------------------ | ------------ |
| GPT 4                    | 8.990625     |
| GPT 3.5 Turbo            | 7.943750     |
| Claude v1                | 7.900000     |
| **Ghost 8B Beta (d0x5)** | **7.740506** |
| Vicuna 33B v1.3          | 7.121875     |
| WizardLM 30B             | 7.009375     |
| Llama 2 70B chat         | 6.856250     |

**Plots**

With MT Bench's plot we will see some other more specific things. In STEM, writing and roleplay scores, the model is at the same level as GPT 4, notably the score in STEM is at an absolute level similar to GPT 4.

![MT Bench, Plots](https://ghost-x.org/docs/models/ghost-8b-beta/images/mt-bench-plots_hu388fe663aadaf41c7aa98d16649cd0a8_103514_971x525_resize_q85_h2_lanczos_3.webp)

#### A quick talk about MT Bench

[MT Bench](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md) is a set of challenging, multi-turn, and open-ended questions for evaluating chat assistants. It uses LLM-as-a-judge to evaluate model responses.

```tex
@misc{zheng2023judging,
      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
      year={2023},
      eprint={2306.05685},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

### GSM8K

GSM8K is an evaluation to test the mathematical capabilities of a widely trusted and accepted language model. In this evaluation, Ghost 8B Beta was run with "gsm8k_cot_zeroshot", which means answering questions without instructions, the score achieved was approximately 67%, it was very impressive.

| tasks              | version | filter           | n shot | metric      |  value | std error |
| ------------------ | ------: | ---------------- | -----: | ----------- | -----: | --------: |
| gsm8k_cot_zeroshot |       3 | flexible-extract |      0 | exact_match | 0.6641 |    0.0130 |

**Leaderboard**

The rankings are referenced from this [article](https://klu.ai/glossary/GSM8K-eval).

Based on the leaderboard, we can see that Ghost 8B Beta outperforms proprietary models such as xAI Grok 1, OpenAI GPT 3.5 and Mistral Mixtral 8x7b. In addition, Ghost 8B Beta is considered equal to Mistral Medium. Furthermore, only the Ghost 8B Beta, Claude 2 and Claude 3 are rare models that use the zero shot method for evaluation.

| model                    | accuracy  | methodology                   |
| ------------------------ | --------- | ----------------------------- |
| Claude 3                 | 95%       | Zero shot                     |
| Gemini Ultra             | 94.4%     | Majority Vote, 32 Generations |
| GPT 4                    | 92%       | SFT & 5-shot CoT              |
| Claude 2                 | 88%       | Zero shot                     |
| Gemini Pro               | 86.5%     | Majority Vote, 32 Generations |
| Mistral Large            | 81%       | 5-shot Learning               |
| PaLM 2                   | 80%       | 5-shot Learning               |
| Mistral Medium           | 66.7%     | 5-shot Learning               |
| **Ghost 8B Beta (d0x5)** | **66.4%** | **Zero shot**                 |
| xAI Grok 1               | 62.9%     | 8-shot Learning               |
| Mistral Mixtral 8x7B     | 58.4%     | 5-shot Learning               |
| GPT 3.5                  | 57.1%     | 5-shot Learning               |

<details close>
  <summary>Snapshot of original results</summary>
  ~

```json
{
  "results": {
    "gsm8k_cot_zeroshot": {
      "alias": "gsm8k_cot_zeroshot",
      "exact_match,strict-match": 0.576194086429113,
      "exact_match_stderr,strict-match": 0.013611632008810354,
      "exact_match,flexible-extract": 0.6641394996209249,
      "exact_match_stderr,flexible-extract": 0.01300922471426737
    }
  },
  "model_source": "hf",
  "model_name": "lamhieu/ghost-8b-beta-disl-0x5",
  "model_name_sanitized": "lamhieu__ghost-8b-beta-disl-0x5",
  "start_time": 1272225.578433881,
  "end_time": 1276832.731746671,
  "total_evaluation_time_seconds": "4607.153312789975",
  ...
}
```

</details>

#### A quick talk about GSM8K

[Training Verifiers to Solve Math Word Problems](https://arxiv.org/abs/2110.14168)

```tex
@misc{cobbe2021training,
      title={Training Verifiers to Solve Math Word Problems},
      author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
      year={2021},
      eprint={2110.14168},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```

### GPQA

Ghost 8B Beta was evaluated with the task "gpqa_diamond_cot_zeroshot", the model scored approximately 27%. Overall, so you can easily visualize we have compiled a leaderboard so you can understand the model's capabilities in an overview.

| tasks                     | version | filter           | n shot | metric      |  value | std error |
| ------------------------- | ------: | ---------------- | -----: | ----------- | -----: | --------: |
| gpqa_diamond_cot_zeroshot |       1 | flexible-extract |      0 | exact_match | 0.2676 |    0.0315 |

**Leaderboard**

The rankings are referenced from this [article](https://www.anthropic.com/news/claude-3-family).

| model                    | accuracy  | methodology       |
| ------------------------ | --------- | ----------------- |
| Claude 3 Opus            | 50.4%     | Zero-shot CoT     |
| Claude 3 Sonet           | 40.4%     | Zero-shot CoT     |
| GPT-4                    | 35.7%     | Zero-shot CoT     |
| Claude 3 Haiku           | 33.3%     | Zero-shot CoT     |
| GPT-3.5                  | 28.1%     | Zero-shot CoT     |
| **Ghost 8B Beta (d0x5)** | **26.7%** | **Zero-shot CoT** |

<details close>
  <summary>Snapshot of original results</summary>
  ~

```json
{
  "results": {
    "gpqa_diamond_cot_zeroshot": {
      "alias": "gpqa_diamond_cot_zeroshot",
      "exact_match,strict-match": 0.005050505050505051,
      "exact_match_stderr,strict-match": 0.005050505050505052,
      "exact_match,flexible-extract": 0.2676767676767677,
      "exact_match_stderr,flexible-extract": 0.03154449888270285
    }
  },
  "model_source": "hf",
  "model_name": "lamhieu/ghost-8b-beta-disl-0x5",
  "model_name_sanitized": "lamhieu__ghost-8b-beta-disl-0x5",
  "start_time": 1290603.02832825,
  "end_time": 1294954.839677157,
  "total_evaluation_time_seconds": "4351.811348907184"
  ...
}
```

</details>

#### A quick talk about GPQA

[GPQA: A Graduate-Level Google-Proof Q&A Benchmark](https://arxiv.org/abs/2311.12022).

```tex
@misc{rein2023gpqa,
      title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
      author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
      year={2023},
      eprint={2311.12022},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}
```

### MMLU Pro

Ghost 8B Beta was evaluated with the task "leaderboard_mmlu_pro", the model scored 30%. Overall, so you can easily visualize we have compiled a leaderboard so you can understand the model's capabilities in an overview.

| tasks                | version | filter | n shot | metric |  value | std error |
| -------------------- | ------: | ------ | -----: | ------ | -----: | --------: |
| leaderboard_mmlu_pro |     0.1 | none   |      5 | acc    | 0.3042 |    0.0042 |

**Leaderboard**

The rankings are referenced from this [leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard).

| model                    | accuracy   | methodology |
| ------------------------ | ---------- | ----------- |
| Qwen 2 72B Instruct      | 48.92%     | 5-shot      |
| Llama 3 70B Instruct     | 46.74%     | 5-shot      |
| Deepseek LLM 67B Chat    | 32.72%     | 5-shot      |
| **Ghost 8B Beta (d0x5)** | **30.42%** | **5-shot**  |
| DBRX Instruct 132B       | 29.81%     | 5-shot      |
| Llama 3 8B Instruct      | 29.60%     | 5-shot      |
| SOLAR 10.7B v1.0         | 26.67%     | 5-shot      |
| C4AI Command-R 35B       | 26.33%     | 5-shot      |
| Aya 23 35B               | 26.18%     | 5-shot      |
| Llama 2 70B Chat         | 15.92%     | 5-shot      |

<details close>
  <summary>Snapshot of original results</summary>
  ~

```json
{
  "results": {
    "leaderboard_mmlu_pro": {
      "alias": "leaderboard_mmlu_pro",
      "acc,none": 0.30418882978723405,
      "acc_stderr,none": 0.004194367367612373
    }
  },
  "model_source": "hf",
  "model_name": "lamhieu/ghost-8b-beta-disl-0x5",
  "model_name_sanitized": "lamhieu__ghost-8b-beta-disl-0x5",
  "start_time": 1759569.60272917,
  "end_time": 1761073.963532963,
  "total_evaluation_time_seconds": "1504.3608037931845",
  ...
}
```

</details>

#### A quick talk about MMLU Pro

[MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark](https://arxiv.org/abs/2311.12022).

```tex
@misc{wang2024mmluprorobustchallengingmultitask,
      title={MMLU-Pro: A More Robust and Challenging Multi-Task Language
      Understanding Benchmark},
      author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
      year={2024},
      eprint={2406.01574},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.01574},
}
```

### There's more

Other evaluations will be done soon and updated here later.

## Notes

If the model is interesting or helpful to your work, feel free to buy me a beer and let's raise a glass. It will motivate me to improve the model in the future. 🍻

<a href="https://www.buymeacoffee.com/lh0x00" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 60px !important;width: 217px !important;" ></a>

### Thanks

The project sends sincere thanks to friends, including...

#### Meta

<p><img src="https://cdn-avatars.huggingface.co/v1/production/uploads/646cf8084eefb026fb8fd8bc/oCTqufkdTkjyGodsx1vo1.png" width="150px" align="center" /></p>

[Meta](https://kaggle.com), for providing a great foundational model.

#### Unsloth

<p><img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/made with unsloth.png" width="150px" align="center" /></p>

[Unsloth](https://unsloth.ai), a great tool that helps us easily develop products, at a lower cost than expected.

#### Kaggle

<p><img src="https://cdn-uploads.huggingface.co/production/uploads/600ae38cc92b79f54efd4556/dcbpF6YS6RQhqDd6GZZ2v.png" width="150px" align="center" /></p>

[Kaggle](https://kaggle.com), free GPUs are available to the community for research.

### Contact

If you want to cooperate, contribute, consult or sponsor, please contact us.

Follow **Ghost X** to stay updated with the latest information.

- Twitter/X via [@ghostx_ai](https://twitter.com/ghostx_ai).
- HuggingFace via [@ghost-x](https://huggingface.co/ghost-x).
- Official website [ghost-x.org](https://ghost-x.org/).
- Email: [lamhieu.vk [at] gmail dot com](mailto:[email protected]) (author)

### Cites

If you find our work helpful, feel free to give us a cite.

```tex
@misc{ghost-8b-beta,
  author = {{Ghost X, Hieu Lam}},
  title = {Ghost 8B Beta},
  url = {https://ghost-x.org/docs/models/ghost-8b-beta},
  year = {2024}
}
```