seungahdev commited on
Commit
37a2934
1 Parent(s): 9dd7ac8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +288 -0
README.md ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: mistralai/Mixtral-8x22B-Instruct-v0.1
4
+ inference: false
5
+ model_link: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
6
+ model_name: mistralai/Mixtral-8x22B-Instruct-v0.1
7
+ pipeline_tag: text-generation
8
+ quantized_by: FriendliAI
9
+ tags:
10
+ - pretrained
11
+ ---
12
+
13
+ <!-- header start -->
14
+ <p align="center">
15
+ <img src="https://i.imgur.com/mNM6Cai.png" width="100%" alt="Friendli Logo">
16
+ </p>
17
+ <!-- header end -->
18
+
19
+ # Mixtral-8x22B-Instruct-v0.1 - FP8
20
+
21
+ - Model creator: [Mistral AI](https://huggingface.co/mistralai)
22
+ - Original model: [Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
23
+
24
+ ## Description
25
+
26
+ This repo contains the Mixtral-8x22B-Instruct-v0.1 model quantized to FP8 by FriendliAI, significantly enhancing its inference efficiency while maintaining high accuracy.
27
+ Note that FP8 is only supported by NVIDIA Ada, Hopper, and Blackwell GPU architectures.
28
+ Check out [FriendliAI documentation](https://docs.friendli.ai/) for more details.
29
+
30
+ ## Compatibility
31
+
32
+ This model is compatible with **[Friendli Container](https://friendli.ai/products/container/)**.
33
+
34
+ ## Prerequisites
35
+
36
+ - Before you begin, make sure you have signed up for [Friendli Suite](https://suite.friendli.ai/). **You can use Friendli Containers free of charge for four weeks.**
37
+ - Prepare a Personal Access Token following [this guide](#preparing-personal-access-token).
38
+ - Prepare a Friendli Container Secret following [this guide](#preparing-container-secret).
39
+
40
+ ### Preparing Personal Access Token
41
+
42
+ PAT (Personal Access Token) is the user credential for for logging into our container registry.
43
+
44
+ 1. Sign in [Friendli Suite](https://suite.friendli.ai/).
45
+ 2. Go to **[User Settings > Tokens](https://suite.friendli.ai/user-settings/tokens)** and click **'Create new token'**.
46
+ 3. Save your created token value.
47
+
48
+ ### Pulling Friendli Container Image
49
+
50
+ 1. Log in to the Docker client using the personal access token created as outlined in [this guide](#preparing-personal-access-token).
51
+
52
+ ```sh
53
+ export FRIENDLI_PAT="YOUR PAT"
54
+ docker login registry.friendli.ai -u $YOUR_EMAIL -p $FRIENDLI_PAT
55
+ ```
56
+
57
+ 2. Pull image
58
+
59
+ ```sh
60
+ docker pull registry.friendli.ai/trial
61
+ ```
62
+
63
+ ## Running Friendli Container
64
+
65
+ Once you've prepared the image of Friendli Container, you can launch it to create a serving endpoint.
66
+
67
+ ```sh
68
+ docker run \
69
+ --gpus '"device=0,1,2,3"' \
70
+ -p 8000:8000 \
71
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
72
+ -e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \
73
+ registry.friendli.ai/trial \
74
+ --web-server-port 8000 \
75
+ --hf-model-name FriendliAI/Mixtral-8x22B-Instruct-v0.1-fp8
76
+ ```
77
+
78
+ ### Optimizing Inference Performance with Policy Search
79
+
80
+ To serve MoE models efficiently, it is required to run a policy search to explore the optimal execution policy:
81
+
82
+ ```sh
83
+ export POLICY_DIR=$PWD/policy
84
+
85
+ mkdir -p $POLICY_DIR
86
+
87
+ docker run \
88
+ --gpus '"device=0,1,2,3"' \
89
+ -p 8000:8000 \
90
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
91
+ -v $POLICY_DIR:/policy \
92
+ -e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \
93
+ registry.friendli.ai/trial \
94
+ --web-server-port 8000 \
95
+ --hf-model-name FriendliAI/Mixtral-8x22B-Instruct-v0.1-fp8 \
96
+ --algo-policy-dir /policy \
97
+ --search-policy true
98
+ ```
99
+
100
+ When the optimal policy is successfully searched, the policy is compiled into a policy file and saved at `$POLICY_DIR`.
101
+ Now you can create an inference endpoint with this optimal policy as follows:
102
+
103
+ ```sh
104
+ docker run \
105
+ --gpus '"device=0,1,2,3"' \
106
+ -p 8000:8000 \
107
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
108
+ -v $POLICY_DIR:/policy \
109
+ -e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \
110
+ registry.friendli.ai/trial \
111
+ --web-server-port 8000 \
112
+ --hf-model-name FriendliAI/Mixtral-8x22B-Instruct-v0.1-fp8 \
113
+ --algo-policy-dir /policy
114
+ ```
115
+
116
+ ---
117
+
118
+ # Original model card: MistralAI's Mixtral-8x22B-Instruct v0.1
119
+
120
+ # Model Card for Mixtral-8x22B-Instruct-v0.1
121
+ The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1).
122
+
123
+ ## Run the model
124
+ ```python
125
+ from transformers import AutoModelForCausalLM
126
+ from mistral_common.protocol.instruct.messages import (
127
+ AssistantMessage,
128
+ UserMessage,
129
+ )
130
+ from mistral_common.protocol.instruct.tool_calls import (
131
+ Tool,
132
+ Function,
133
+ )
134
+ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
135
+ from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
136
+
137
+ device = "cuda" # the device to load the model onto
138
+
139
+ tokenizer_v3 = MistralTokenizer.v3()
140
+
141
+ mistral_query = ChatCompletionRequest(
142
+ tools=[
143
+ Tool(
144
+ function=Function(
145
+ name="get_current_weather",
146
+ description="Get the current weather",
147
+ parameters={
148
+ "type": "object",
149
+ "properties": {
150
+ "location": {
151
+ "type": "string",
152
+ "description": "The city and state, e.g. San Francisco, CA",
153
+ },
154
+ "format": {
155
+ "type": "string",
156
+ "enum": ["celsius", "fahrenheit"],
157
+ "description": "The temperature unit to use. Infer this from the users location.",
158
+ },
159
+ },
160
+ "required": ["location", "format"],
161
+ },
162
+ )
163
+ )
164
+ ],
165
+ messages=[
166
+ UserMessage(content="What's the weather like today in Paris"),
167
+ ],
168
+ model="test",
169
+ )
170
+
171
+ encodeds = tokenizer_v3.encode_chat_completion(mistral_query).tokens
172
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1")
173
+ model_inputs = encodeds.to(device)
174
+ model.to(device)
175
+
176
+ generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
177
+ sp_tokenizer = tokenizer_v3.instruct_tokenizer.tokenizer
178
+ decoded = sp_tokenizer.decode(generated_ids[0])
179
+ print(decoded)
180
+ ```
181
+ Alternatively, you can run this example with the Hugging Face tokenizer.
182
+ To use this example, you'll need transformers version 4.39.0 or higher.
183
+ ```console
184
+ pip install transformers==4.39.0
185
+ ```
186
+ ```python
187
+ from transformers import AutoModelForCausalLM, AutoTokenizer
188
+
189
+ model_id = "mistralai/Mixtral-8x22B-Instruct-v0.1"
190
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
191
+ conversation=[
192
+ {"role": "user", "content": "What's the weather like in Paris?"},
193
+ {
194
+ "role": "tool_calls",
195
+ "content": [
196
+ {
197
+ "name": "get_current_weather",
198
+ "arguments": {"location": "Paris, France", "format": "celsius"},
199
+
200
+ }
201
+ ]
202
+ },
203
+ {
204
+ "role": "tool_results",
205
+ "content": {"content": 22}
206
+ },
207
+ {"role": "assistant", "content": "The current temperature in Paris, France is 22 degrees Celsius."},
208
+ {"role": "user", "content": "What about San Francisco?"}
209
+ ]
210
+
211
+
212
+ tools = [{"type": "function", "function": {"name":"get_current_weather", "description": "Get▁the▁current▁weather", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "format": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The temperature unit to use. Infer this from the users location."}},"required":["location","format"]}}}]
213
+
214
+ # render the tool use prompt as a string:
215
+ tool_use_prompt = tokenizer.apply_chat_template(
216
+ conversation,
217
+ chat_template="tool_use",
218
+ tools=tools,
219
+ tokenize=False,
220
+ add_generation_prompt=True,
221
+
222
+ )
223
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1")
224
+
225
+ inputs = tokenizer(tool_use_prompt, return_tensors="pt")
226
+
227
+ outputs = model.generate(**inputs, max_new_tokens=20)
228
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
229
+ ```
230
+
231
+ # Instruct tokenizer
232
+ The HuggingFace tokenizer included in this release should match our own. To compare:
233
+ `pip install mistral-common`
234
+
235
+ ```py
236
+ from mistral_common.protocol.instruct.messages import (
237
+ AssistantMessage,
238
+ UserMessage,
239
+ )
240
+ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
241
+ from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
242
+
243
+ from transformers import AutoTokenizer
244
+
245
+ tokenizer_v3 = MistralTokenizer.v3()
246
+
247
+ mistral_query = ChatCompletionRequest(
248
+ messages=[
249
+ UserMessage(content="How many experts ?"),
250
+ AssistantMessage(content="8"),
251
+ UserMessage(content="How big ?"),
252
+ AssistantMessage(content="22B"),
253
+ UserMessage(content="Noice 🎉 !"),
254
+ ],
255
+ model="test",
256
+ )
257
+ hf_messages = mistral_query.model_dump()['messages']
258
+
259
+ tokenized_mistral = tokenizer_v3.encode_chat_completion(mistral_query).tokens
260
+
261
+ tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x22B-Instruct-v0.1')
262
+ tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True)
263
+
264
+ assert tokenized_hf == tokenized_mistral
265
+ ```
266
+
267
+ # Function calling and special tokens
268
+ This tokenizer includes more special tokens, related to function calling :
269
+ - [TOOL_CALLS]
270
+ - [AVAILABLE_TOOLS]
271
+ - [/AVAILABLE_TOOLS]
272
+ - [TOOL_RESULTS]
273
+ - [/TOOL_RESULTS]
274
+
275
+ If you want to use this model with function calling, please be sure to apply it similarly to what is done in our [SentencePieceTokenizerV3](https://github.com/mistralai/mistral-common/blob/main/src/mistral_common/tokens/tokenizers/sentencepiece.py#L299).
276
+
277
+ # The Mistral AI Team
278
+ Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Antoine Roux,
279
+ Arthur Mensch, Audrey Herblin-Stoop, Baptiste Bout, Baudouin de Monicault,
280
+ Blanche Savary, Bam4d, Caroline Feldman, Devendra Singh Chaplot,
281
+ Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger,
282
+ Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona,
283
+ Jean-Malo Delignon, Jia Li, Justus Murke, Louis Martin, Louis Ternon,
284
+ Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat,
285
+ Marie Torelli, Marie-Anne Lachaux, Nicolas Schuhl, Patrick von Platen,
286
+ Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao,
287
+ Thibaut Lavril, Timothée Lacroix, Théophile Gervet, Thomas Wang,
288
+ Valera Nemychnikova, William El Sayed, William Marshall