File size: 2,072 Bytes
1990bc5
7a52ff4
1990bc5
7a52ff4
5ad7a4b
7a52ff4
19f9426
7a52ff4
 
 
d81efc2
 
19fe3d1
d81efc2
 
 
 
19fe3d1
 
d81efc2
19fe3d1
75f4cb5
d81efc2
 
 
 
 
 
 
 
7a52ff4
 
d81efc2
7a52ff4
 
d81efc2
 
 
 
 
 
 
7a52ff4
 
d81efc2
 
7a52ff4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d81efc2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a52ff4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
{}
---

# MPT-7b-8k-chat

This model is originally released under CC-BY-NC-SA-4.0, and the AWQ framework is MIT licensed.

Original model can be found at [https://huggingface.co/mosaicml/mpt-7b-8k-chat](https://huggingface.co/mosaicml/mpt-7b-8k-chat).

## ⚡ 4-bit Inference Speed 

Machines rented from RunPod - speed may vary dependent on both GPU/CPU.

H100:
- CUDA 12.0, Driver 525.105.17: 92 tokens/s (10.82 ms/token)

RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)

RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)

A6000 (2 different VMs):
- CUDA 12.0, Driver 525.105.17: 61 tokens/s (16.31 ms/token)
- CUDA 12.1, Driver 530.30.02: 46 tokens/s (21.79 ms/token)

## How to run

Install [AWQ](https://github.com/mit-han-lab/llm-awq):

```sh
git clone https://github.com/mit-han-lab/llm-awq && \
cd llm-awq && \
pip3 install -e . && \
cd awq/kernels && \
python3 setup.py install && \
cd ../.. && \
pip3 install einops
```

Run:

```sh
hfuser="casperhansen"
model_name="mpt-7b-8k-chat-awq"
group_size=128
repo_path="$hfuser/$model_name"
model_path="/workspace/llm-awq/$model_name"
quantized_model_path="/workspace/llm-awq/$model_name/$model_name-w4-g$group_size.pt"

git clone https://huggingface.co/$repo_path

python3 tinychat/demo.py --model_type mpt \
    --model_path $model_path \
    --q_group_size $group_size \
    --load_quant $quantized_model_path \
    --precision W4A16
```

## Citation

Please cite this model using the following format:

```
@online{MosaicML2023Introducing,
    author    = {MosaicML NLP Team},
    title     = {Introducing MPT-30B: Raising the bar
for open-source foundation models},
    year      = {2023},
    url       = {www.mosaicml.com/blog/mpt-30b},
    note      = {Accessed: 2023-06-22},
    urldate   = {2023-06-22}
}
```