File size: 8,803 Bytes
9f6ae3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0810881
9f6ae3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: apache-2.0
inference: false
---

# MistralLite-AWQ Model

MistralLite-AWQ is a version of the [MistralLite](https://huggingface.co/amazon/MistralLite) model that was
quantized using the AWQ method developed by [Lin et al. (2023)](https://arxiv.org/abs/2306.00978).
The MistralLite-AWQ models are approximately **70% smaller** than those of MistralLite whilst maintaining comparable performance.

Please refer to the [original MistralLite model card](https://huggingface.co/amazon/MistralLite) for details about the model
preparation and training processes.

## MistralLite-AWQ Variants

| Branch | Approx. Model Size | `q_group_size` | `w_bit` | `version` |
|--------|---:|---------------:|--------:|-----------|
| [main](https://huggingface.co/amazon/MistralLite-AWQ/tree/main) | 3.9 GB | 128 | 4 | GEMM |
| [MistralLite-AWQ-64g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-64g-4b-GEMM) | 4.0 GB | 64 | 4 | GEMM |
| [MistralLite-AWQ-32g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-32g-4b-GEMM) | 4.3 GB | 32 | 4 | GEMM |

## Dependencies
- [`autoawq==0.2.5`](https://pypi.org/project/autoawq/0.2.5/) – [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) was used to quantize the MistralLite model.
- [`vllm==0.4.2`](https://pypi.org/project/vllm/0.4.2/) – [vLLM](https://github.com/vllm-project/vllm) was used to host models for benchmarking.

## Evaluations

### Long Context

The following benchmark results are shown as _accuracy_ (%) values, unless stated otherwise.

#### Topic Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/

| Model Name                                         |   n_topics=05 |   n_topics=10 |   n_topics=15 |   n_topics=20 |   n_topics=25 |
|:---------------------------------------------------|--------------:|--------------:|--------------:|--------------:|--------------:|
| _n_tokens_ (approx.) =         | _3048_ | _5966_ | _8903_ | _11832_ | _14757_ |
| MistralLite                                        |           100 |           100 |           100 |           100 |            98 |
| **MistralLite-AWQ**           |          **100** |          **100** |          **100**|          **100** |           **98** |
| **MistralLite-AWQ-64g-4b-GEMM**            |          **100** |          **100** |          **100**|          **100** |           **98** |
| **MistralLite-AWQ-32g-4b-GEMM**            |          **100** |          **100** |          **100**|          **100** |           **98** |
| Mistral-7B-Instruct-v0.1                           |            96 |            52 |             2 |             0 |             0 |
| Mistral-7B-Instruct-v0.2                           |           100 |           100 |           100 |           100 |           100 |
| Mixtral-8x7B-v0.1                                  |             0 |             0 |             0 |             0 |             0 |
| Mixtral-8x7B-Instruct-v0.1                         |           100 |           100 |           100 |           100 |           100 |

#### [Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/#longeval-results)

See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results

| Model Name                                         |   n_lines=200 |   n_lines=300 |   n_lines=400 |   n_lines=500 |   n_lines=600 |   n_lines=680 |
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
| _n_tokens_ (approx.) =         | _4317_ | _6415_ | _8510_ | _10610_ | _12698_ | _14373_ | 
| MistralLite                                        |           100 |            94 |            86 |            82 |            76 |            66 |
| **MistralLite-AWQ**           |           **96**|           **94**|           **88** |           **80** |           **70**|           **62** |
| **MistralLite-AWQ-64g-4b-GEMM**            |           **96**|           **96**|           **90** |           **70** |           **72**|           **60** |
| **MistralLite-AWQ-32g-4b-GEMM**            |           **98**|           **96**|           **84** |           **76** |           **70**|           **62** |
| Mistral-7B-Instruct-v0.1                           |            96 |            56 |            38 |            36 |            30 |            30 |
| Mistral-7B-Instruct-v0.2                           |           100 |           100 |            96 |            98 |            96 |            84 |
| Mixtral-8x7B-v0.1                                  |            54 |            38 |            56 |            66 |            62 |            38 |
| Mixtral-8x7B-Instruct-v0.1                         |           100 |           100 |           100 |           100 |           100 |           100 |

#### Pass Key Retrieval

See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101

| Model Name                               |   n_garbage=12000 |   n_garbage=20000 |   n_garbage=31000 |   n_garbage=38000 |   n_garbage=45000 | n_garbage=60000 |
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
| _n_tokens_ (approx.) =         | _3272_ | _5405_ | _8338_ | _10205_ | _12071_ | _16072_ |
| MistralLite                              |               100 |               100 |               100 |               100 |               100 | 100|
| **MistralLite-AWQ** |              **100** |             **100**|              **100**|              **100** |              **100**| **100**|
| **MistralLite-AWQ-64g-4b-GEMM**  |              **100** |             **100**|              **100**|              **100** |              **100**| **100**|
| **MistralLite-AWQ-32g-4b-GEMM**  |              **100** |             **100**|              **100**|              **100** |              **100**| **100**|
| Mistral-7B-Instruct-v0.1                            |               100 |                50 |                30 |                20 |                10 |                10 |
| Mistral-7B-Instruct-v0.2                            |               100 |               100 |               100 |               100 |               100 |               100 |
| Mixtral-8x7B-v0.1                                   |               100 |               100 |               100 |               100 |               100 |               100 |
| Mixtral-8x7B-Instruct-v0.1                          |               100 |               100 |               100 |                90 |               100 |               100 |


#### QuALITY (Question Answering with Long Input Texts, Yes!)

See https://nyu-mll.github.io/quality/

|Model Name| Test set Accuracy | Hard subset Accuracy|
|:----------|-------------:|-------------:|
| MistralLite                              |   56.8 |       74.5 |
| **MistralLite-AWQ** |  **55.3** |      **71.8** |
| **MistralLite-AWQ-64g-4b-GEMM**  |  **55.2** |      **72.9** |
| **MistralLite-AWQ-32g-4b-GEMM**  |  **56.6** |      **72.8** |
| Mistral-7B-Instruct-v0.1                 |   45.2 |       58.9 |
| Mistral-7B-Instruct-v0.2                 |   55.5 |       74   |
| Mixtral-8x7B-v0.1                        |   75   |       74.1 |
| Mixtral-8x7B-Instruct-v0.1               |   68.7 |       83.3 |

## Usage

## Inference via vLLM HTTP Host

### Launch Host
```bash
python -m vllm.entrypoints.openai.api_server \
    --model amazon/MistralLite-AWQ \
    --quantization awq
```

### Query Host
```bash
curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{ "model": "amazon/MistralLite-AWQ",
          "prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
          "temperature": 0,
          "echo": false
    }'
```

## Inference via [vLLM Offline Inference](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html)
```python
from vllm import LLM, SamplingParams

prompts = [
   "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)

llm = LLM(model="amazon/MistralLite-AWQ")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

## License

Apache 2.0

## Limitations

Before using the MistralLite-AWQ model, it is important to perform your own
independent assessment, and take measures to ensure that your use would comply
with your own specific quality control practices and standards, and that your
use would comply with the local rules, laws, regulations, licenses and terms
that apply to you, and your content.