mofosyne commited on
Commit
eac5a32
β€’
1 Parent(s): 224e4ef

update readme

Browse files
Files changed (1) hide show
  1. README.md +133 -132
README.md CHANGED
@@ -17,7 +17,7 @@ tags:
17
  ## Description
18
 
19
  * This repo is targeted towards:
20
- - People who just want to quickly try out the llamafile technology by running `./TinyLLama-v0-5M-F16.llamafile --cli -p "hello world"` as this llamafile is only 17.6Β MB in size!
21
  - Developers who would like a quick demo on the steps to convert an existing model from safe tensor format to a gguf and packaged into a llamafile for easy distribution (Just run `llamafile-creation.sh` to retrace the steps).
22
  - Researchers who are just curious about the state of AI technology in terms of shrinking AI models, as the original model was from a replication attempt of a research paper.
23
 
@@ -41,13 +41,13 @@ Anyway, this conversion to [llamafile](https://github.com/Mozilla-Ocho/llamafile
41
 
42
  ```bash
43
  # if not already usable
44
- chmod +x TinyLLama-v0-5M-F16.llamafile
45
 
46
  # To start the llamafile in web sever mode just call this directly
47
- ./TinyLLama-v0-5M-F16.llamafile
48
 
49
  # To start the llamafile in command line use this command
50
- ./TinyLLama-v0-5M-F16.llamafile --cli -p "A dog and a cat"
51
  ```
52
 
53
  ## About llamafile
@@ -65,140 +65,148 @@ llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023. It uses C
65
  For the most current replication steps, refer to the bash script `llamafile-creation.sh` in this repo.
66
 
67
  ```
68
- $ llamafile-creation.sh
69
- llamafile-creation.sh: command not found
70
- mofosyne@mofosyne-Z97MX-Gaming-5:~/huggingface/TinyLLama-v0-llamafile$ ./llamafile-creation.sh
71
  == Prep Enviroment ==
72
  == Build and prep the llamafile engine execuable ==
73
- ~/huggingface/TinyLLama-v0-llamafile/llamafile ~/huggingface/TinyLLama-v0-llamafile
74
  make: Nothing to be done for 'all'.
75
  make: Nothing to be done for 'all'.
76
- ~/huggingface/TinyLLama-v0-llamafile
77
  == What is our llamafile name going to be? ==
78
- We will be aiming to generate TinyLLama-v0-5M-F16.llamafile
79
  == Convert from safetensor to gguf ==
80
- INFO:convert:Loading model file maykeye_tinyllama/model.safetensors
81
- INFO:convert:model parameters count : 4621392 (5M)
82
- INFO:convert:params = Params(n_vocab=32000, n_embd=64, n_layer=8, n_ctx=2048, n_ff=256, n_head=16, n_head_kv=16, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=None, f_rope_freq_base=None, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('maykeye_tinyllama'))
83
- INFO:convert:Loaded vocab file PosixPath('maykeye_tinyllama/tokenizer.model'), type 'spm'
84
- INFO:convert:Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>
85
- INFO:convert:Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 0}, add special tokens unset>
86
- INFO:convert:Writing maykeye_tinyllama/TinyLLama-v0-5M-F16.gguf, format 1
87
- WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
88
  INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
 
 
 
 
 
 
 
 
 
89
  INFO:gguf.vocab:Setting special token type bos to 1
90
  INFO:gguf.vocab:Setting special token type eos to 2
91
  INFO:gguf.vocab:Setting special token type unk to 0
92
  INFO:gguf.vocab:Setting special token type pad to 0
93
- INFO:convert:[ 1/75] Writing tensor output.weight | size 32000 x 64 | type F16 | T+ 0
94
- INFO:convert:[ 2/75] Writing tensor token_embd.weight | size 32000 x 64 | type F16 | T+ 0
95
- INFO:convert:[ 3/75] Writing tensor blk.0.attn_norm.weight | size 64 | type F32 | T+ 0
96
- INFO:convert:[ 4/75] Writing tensor blk.0.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
97
- INFO:convert:[ 5/75] Writing tensor blk.0.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
98
- INFO:convert:[ 6/75] Writing tensor blk.0.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
99
- INFO:convert:[ 7/75] Writing tensor blk.0.ffn_norm.weight | size 64 | type F32 | T+ 0
100
- INFO:convert:[ 8/75] Writing tensor blk.0.attn_k.weight | size 64 x 64 | type F16 | T+ 0
101
- INFO:convert:[ 9/75] Writing tensor blk.0.attn_output.weight | size 64 x 64 | type F16 | T+ 0
102
- INFO:convert:[10/75] Writing tensor blk.0.attn_q.weight | size 64 x 64 | type F16 | T+ 0
103
- INFO:convert:[11/75] Writing tensor blk.0.attn_v.weight | size 64 x 64 | type F16 | T+ 0
104
- INFO:convert:[12/75] Writing tensor blk.1.attn_norm.weight | size 64 | type F32 | T+ 0
105
- INFO:convert:[13/75] Writing tensor blk.1.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
106
- INFO:convert:[14/75] Writing tensor blk.1.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
107
- INFO:convert:[15/75] Writing tensor blk.1.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
108
- INFO:convert:[16/75] Writing tensor blk.1.ffn_norm.weight | size 64 | type F32 | T+ 0
109
- INFO:convert:[17/75] Writing tensor blk.1.attn_k.weight | size 64 x 64 | type F16 | T+ 0
110
- INFO:convert:[18/75] Writing tensor blk.1.attn_output.weight | size 64 x 64 | type F16 | T+ 0
111
- INFO:convert:[19/75] Writing tensor blk.1.attn_q.weight | size 64 x 64 | type F16 | T+ 0
112
- INFO:convert:[20/75] Writing tensor blk.1.attn_v.weight | size 64 x 64 | type F16 | T+ 0
113
- INFO:convert:[21/75] Writing tensor blk.2.attn_norm.weight | size 64 | type F32 | T+ 0
114
- INFO:convert:[22/75] Writing tensor blk.2.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
115
- INFO:convert:[23/75] Writing tensor blk.2.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
116
- INFO:convert:[24/75] Writing tensor blk.2.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
117
- INFO:convert:[25/75] Writing tensor blk.2.ffn_norm.weight | size 64 | type F32 | T+ 0
118
- INFO:convert:[26/75] Writing tensor blk.2.attn_k.weight | size 64 x 64 | type F16 | T+ 0
119
- INFO:convert:[27/75] Writing tensor blk.2.attn_output.weight | size 64 x 64 | type F16 | T+ 0
120
- INFO:convert:[28/75] Writing tensor blk.2.attn_q.weight | size 64 x 64 | type F16 | T+ 0
121
- INFO:convert:[29/75] Writing tensor blk.2.attn_v.weight | size 64 x 64 | type F16 | T+ 0
122
- INFO:convert:[30/75] Writing tensor blk.3.attn_norm.weight | size 64 | type F32 | T+ 0
123
- INFO:convert:[31/75] Writing tensor blk.3.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
124
- INFO:convert:[32/75] Writing tensor blk.3.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
125
- INFO:convert:[33/75] Writing tensor blk.3.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
126
- INFO:convert:[34/75] Writing tensor blk.3.ffn_norm.weight | size 64 | type F32 | T+ 0
127
- INFO:convert:[35/75] Writing tensor blk.3.attn_k.weight | size 64 x 64 | type F16 | T+ 0
128
- INFO:convert:[36/75] Writing tensor blk.3.attn_output.weight | size 64 x 64 | type F16 | T+ 0
129
- INFO:convert:[37/75] Writing tensor blk.3.attn_q.weight | size 64 x 64 | type F16 | T+ 0
130
- INFO:convert:[38/75] Writing tensor blk.3.attn_v.weight | size 64 x 64 | type F16 | T+ 0
131
- INFO:convert:[39/75] Writing tensor blk.4.attn_norm.weight | size 64 | type F32 | T+ 0
132
- INFO:convert:[40/75] Writing tensor blk.4.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
133
- INFO:convert:[41/75] Writing tensor blk.4.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
134
- INFO:convert:[42/75] Writing tensor blk.4.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
135
- INFO:convert:[43/75] Writing tensor blk.4.ffn_norm.weight | size 64 | type F32 | T+ 0
136
- INFO:convert:[44/75] Writing tensor blk.4.attn_k.weight | size 64 x 64 | type F16 | T+ 0
137
- INFO:convert:[45/75] Writing tensor blk.4.attn_output.weight | size 64 x 64 | type F16 | T+ 0
138
- INFO:convert:[46/75] Writing tensor blk.4.attn_q.weight | size 64 x 64 | type F16 | T+ 0
139
- INFO:convert:[47/75] Writing tensor blk.4.attn_v.weight | size 64 x 64 | type F16 | T+ 0
140
- INFO:convert:[48/75] Writing tensor blk.5.attn_norm.weight | size 64 | type F32 | T+ 0
141
- INFO:convert:[49/75] Writing tensor blk.5.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
142
- INFO:convert:[50/75] Writing tensor blk.5.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
143
- INFO:convert:[51/75] Writing tensor blk.5.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
144
- INFO:convert:[52/75] Writing tensor blk.5.ffn_norm.weight | size 64 | type F32 | T+ 0
145
- INFO:convert:[53/75] Writing tensor blk.5.attn_k.weight | size 64 x 64 | type F16 | T+ 0
146
- INFO:convert:[54/75] Writing tensor blk.5.attn_output.weight | size 64 x 64 | type F16 | T+ 0
147
- INFO:convert:[55/75] Writing tensor blk.5.attn_q.weight | size 64 x 64 | type F16 | T+ 0
148
- INFO:convert:[56/75] Writing tensor blk.5.attn_v.weight | size 64 x 64 | type F16 | T+ 0
149
- INFO:convert:[57/75] Writing tensor blk.6.attn_norm.weight | size 64 | type F32 | T+ 0
150
- INFO:convert:[58/75] Writing tensor blk.6.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
151
- INFO:convert:[59/75] Writing tensor blk.6.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
152
- INFO:convert:[60/75] Writing tensor blk.6.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
153
- INFO:convert:[61/75] Writing tensor blk.6.ffn_norm.weight | size 64 | type F32 | T+ 0
154
- INFO:convert:[62/75] Writing tensor blk.6.attn_k.weight | size 64 x 64 | type F16 | T+ 0
155
- INFO:convert:[63/75] Writing tensor blk.6.attn_output.weight | size 64 x 64 | type F16 | T+ 0
156
- INFO:convert:[64/75] Writing tensor blk.6.attn_q.weight | size 64 x 64 | type F16 | T+ 0
157
- INFO:convert:[65/75] Writing tensor blk.6.attn_v.weight | size 64 x 64 | type F16 | T+ 0
158
- INFO:convert:[66/75] Writing tensor blk.7.attn_norm.weight | size 64 | type F32 | T+ 0
159
- INFO:convert:[67/75] Writing tensor blk.7.ffn_down.weight | size 64 x 256 | type F16 | T+ 0
160
- INFO:convert:[68/75] Writing tensor blk.7.ffn_gate.weight | size 256 x 64 | type F16 | T+ 0
161
- INFO:convert:[69/75] Writing tensor blk.7.ffn_up.weight | size 256 x 64 | type F16 | T+ 0
162
- INFO:convert:[70/75] Writing tensor blk.7.ffn_norm.weight | size 64 | type F32 | T+ 0
163
- INFO:convert:[71/75] Writing tensor blk.7.attn_k.weight | size 64 x 64 | type F16 | T+ 0
164
- INFO:convert:[72/75] Writing tensor blk.7.attn_output.weight | size 64 x 64 | type F16 | T+ 0
165
- INFO:convert:[73/75] Writing tensor blk.7.attn_q.weight | size 64 x 64 | type F16 | T+ 0
166
- INFO:convert:[74/75] Writing tensor blk.7.attn_v.weight | size 64 x 64 | type F16 | T+ 0
167
- INFO:convert:[75/75] Writing tensor output_norm.weight | size 64 | type F32 | T+ 0
168
- INFO:convert:Wrote maykeye_tinyllama/TinyLLama-v0-5M-F16.gguf
 
 
 
169
  == Generating Llamafile ==
170
  == Test Output ==
171
  note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
172
- main: llamafile version 0.8.4
173
- main: seed = 1715571182
174
- llama_model_loader: loaded meta data with 26 key-value pairs and 75 tensors from TinyLLama-v0-5M-F16.gguf (version GGUF V3 (latest))
175
  llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
176
  llama_model_loader: - kv 0: general.architecture str = llama
177
  llama_model_loader: - kv 1: general.name str = TinyLLama
178
  llama_model_loader: - kv 2: general.author str = mofosyne
179
- llama_model_loader: - kv 3: general.version str = v0
180
  llama_model_loader: - kv 4: general.url str = https://huggingface.co/mofosyne/TinyL...
181
  llama_model_loader: - kv 5: general.description str = This gguf is ported from a first vers...
182
- llama_model_loader: - kv 6: general.source.url str = https://huggingface.co/Maykeye/TinyLL...
183
- llama_model_loader: - kv 7: general.source.huggingface.repository str = https://huggingface.co/Maykeye/TinyLL...
184
- llama_model_loader: - kv 8: llama.vocab_size u32 = 32000
185
- llama_model_loader: - kv 9: llama.context_length u32 = 2048
186
- llama_model_loader: - kv 10: llama.embedding_length u32 = 64
187
- llama_model_loader: - kv 11: llama.block_count u32 = 8
188
- llama_model_loader: - kv 12: llama.feed_forward_length u32 = 256
189
- llama_model_loader: - kv 13: llama.rope.dimension_count u32 = 4
190
  llama_model_loader: - kv 14: llama.attention.head_count u32 = 16
191
- llama_model_loader: - kv 15: llama.attention.head_count_kv u32 = 16
192
- llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
193
- llama_model_loader: - kv 17: general.file_type u32 = 1
194
- llama_model_loader: - kv 18: tokenizer.ggml.model str = llama
195
- llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
196
- llama_model_loader: - kv 20: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
197
- llama_model_loader: - kv 21: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
198
- llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 1
199
- llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 2
200
- llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 0
201
- llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 0
 
 
 
202
  llama_model_loader: - type f32: 17 tensors
203
  llama_model_loader: - type f16: 58 tensors
204
  llm_load_vocab: special tokens definition check successful ( 259/32000 ).
@@ -264,7 +272,7 @@ llama_new_context_with_model: CPU compute buffer size = 62.75 MiB
264
  llama_new_context_with_model: graph nodes = 262
265
  llama_new_context_with_model: graph splits = 1
266
 
267
- system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
268
  sampling:
269
  repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
270
  top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
@@ -274,21 +282,14 @@ CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
274
  generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1
275
 
276
 
277
- <s> hello world the gruff man said no. The man was very sad and he was too scared to come back.
278
- The man asked the man, "Why are you sad?"
279
- The man said, "I am scared of a big surprise. I will help you."
280
- The man looked at the boy and said, "I can help you. I can make the little boy's wings. The man makes the girl laugh. She was so kind and happy.
281
- The boy said, "You are too mean to me. You can't give out the problem."
282
- The girl said, "I will help you!"
283
- The man stopped and said, "I can help you. I'm sorry for a little girl, but you must tell the boy to be careful. Do you want to be kind."
284
- The boy smiled and said, "Yes, I want to help you. Let's go into the pond and have fun!"
285
- The boy and the man went to the lake to the pond. They had a great time and the man was able to help.</s> [end of text]
286
 
287
 
288
- llama_print_timings: load time = 7.35 ms
289
- llama_print_timings: sample time = 8.40 ms / 218 runs ( 0.04 ms per token, 25958.56 tokens per second)
290
- llama_print_timings: prompt eval time = 2.90 ms / 8 tokens ( 0.36 ms per token, 2760.52 tokens per second)
291
- llama_print_timings: eval time = 372.10 ms / 217 runs ( 1.71 ms per token, 583.18 tokens per second)
292
- llama_print_timings: total time = 427.19 ms / 225 tokens
293
  Log end
294
  ```
 
17
  ## Description
18
 
19
  * This repo is targeted towards:
20
+ - People who just want to quickly try out the llamafile technology by running `./Tinyllama-5M-v0.2-F16.llamafile --cli -p "hello world"` as this llamafile is only 17.6Β MB in size!
21
  - Developers who would like a quick demo on the steps to convert an existing model from safe tensor format to a gguf and packaged into a llamafile for easy distribution (Just run `llamafile-creation.sh` to retrace the steps).
22
  - Researchers who are just curious about the state of AI technology in terms of shrinking AI models, as the original model was from a replication attempt of a research paper.
23
 
 
41
 
42
  ```bash
43
  # if not already usable
44
+ chmod +x Tinyllama-5M-v0.2-F16.llamafile
45
 
46
  # To start the llamafile in web sever mode just call this directly
47
+ ./Tinyllama-5M-v0.2-F16.llamafile
48
 
49
  # To start the llamafile in command line use this command
50
+ ./Tinyllama-5M-v0.2-F16.llamafile --cli -p "A dog and a cat"
51
  ```
52
 
53
  ## About llamafile
 
65
  For the most current replication steps, refer to the bash script `llamafile-creation.sh` in this repo.
66
 
67
  ```
68
+ $ ./llamafile-creation.sh
 
 
69
  == Prep Enviroment ==
70
  == Build and prep the llamafile engine execuable ==
71
+ ~/huggingface/TinyLLama-v0-5M-F16-llamafile/llamafile ~/huggingface/TinyLLama-v0-5M-F16-llamafile
72
  make: Nothing to be done for 'all'.
73
  make: Nothing to be done for 'all'.
74
+ ~/huggingface/TinyLLama-v0-5M-F16-llamafile
75
  == What is our llamafile name going to be? ==
76
+ We will be aiming to generate Tinyllama-5M-v0.2-F16.llamafile
77
  == Convert from safetensor to gguf ==
78
+ INFO:hf-to-gguf:Loading model: maykeye_tinyllama
79
+ INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
80
+ INFO:gguf.gguf_writer:gguf: Will write to maykeye_tinyllama/Tinyllama-5M-v0.2-F16.gguf
 
 
 
 
 
81
  INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
82
+ INFO:hf-to-gguf:Set meta model
83
+ INFO:hf-to-gguf:Set model parameters
84
+ INFO:hf-to-gguf:gguf: context length = 2048
85
+ INFO:hf-to-gguf:gguf: embedding length = 64
86
+ INFO:hf-to-gguf:gguf: feed forward length = 256
87
+ INFO:hf-to-gguf:gguf: head count = 16
88
+ INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
89
+ INFO:hf-to-gguf:gguf: file type = 1
90
+ INFO:hf-to-gguf:Set model tokenizer
91
  INFO:gguf.vocab:Setting special token type bos to 1
92
  INFO:gguf.vocab:Setting special token type eos to 2
93
  INFO:gguf.vocab:Setting special token type unk to 0
94
  INFO:gguf.vocab:Setting special token type pad to 0
95
+ INFO:hf-to-gguf:Exporting model to 'maykeye_tinyllama/Tinyllama-5M-v0.2-F16.gguf'
96
+ INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
97
+ INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {64, 32000}
98
+ INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> F16, shape = {64, 32000}
99
+ INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {64}
100
+ INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64}
101
+ INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256}
102
+ INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256}
103
+ INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64}
104
+ INFO:hf-to-gguf:blk.0.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64}
105
+ INFO:hf-to-gguf:blk.0.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64}
106
+ INFO:hf-to-gguf:blk.0.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64}
107
+ INFO:hf-to-gguf:blk.0.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64}
108
+ INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.bfloat16 --> F32, shape = {64}
109
+ INFO:hf-to-gguf:blk.1.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64}
110
+ INFO:hf-to-gguf:blk.1.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256}
111
+ INFO:hf-to-gguf:blk.1.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256}
112
+ INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64}
113
+ INFO:hf-to-gguf:blk.1.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64}
114
+ INFO:hf-to-gguf:blk.1.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64}
115
+ INFO:hf-to-gguf:blk.1.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64}
116
+ INFO:hf-to-gguf:blk.1.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64}
117
+ INFO:hf-to-gguf:blk.2.attn_norm.weight, torch.bfloat16 --> F32, shape = {64}
118
+ INFO:hf-to-gguf:blk.2.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64}
119
+ INFO:hf-to-gguf:blk.2.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256}
120
+ INFO:hf-to-gguf:blk.2.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256}
121
+ INFO:hf-to-gguf:blk.2.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64}
122
+ INFO:hf-to-gguf:blk.2.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64}
123
+ INFO:hf-to-gguf:blk.2.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64}
124
+ INFO:hf-to-gguf:blk.2.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64}
125
+ INFO:hf-to-gguf:blk.2.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64}
126
+ INFO:hf-to-gguf:blk.3.attn_norm.weight, torch.bfloat16 --> F32, shape = {64}
127
+ INFO:hf-to-gguf:blk.3.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64}
128
+ INFO:hf-to-gguf:blk.3.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256}
129
+ INFO:hf-to-gguf:blk.3.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256}
130
+ INFO:hf-to-gguf:blk.3.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64}
131
+ INFO:hf-to-gguf:blk.3.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64}
132
+ INFO:hf-to-gguf:blk.3.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64}
133
+ INFO:hf-to-gguf:blk.3.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64}
134
+ INFO:hf-to-gguf:blk.3.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64}
135
+ INFO:hf-to-gguf:blk.4.attn_norm.weight, torch.bfloat16 --> F32, shape = {64}
136
+ INFO:hf-to-gguf:blk.4.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64}
137
+ INFO:hf-to-gguf:blk.4.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256}
138
+ INFO:hf-to-gguf:blk.4.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256}
139
+ INFO:hf-to-gguf:blk.4.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64}
140
+ INFO:hf-to-gguf:blk.4.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64}
141
+ INFO:hf-to-gguf:blk.4.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64}
142
+ INFO:hf-to-gguf:blk.4.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64}
143
+ INFO:hf-to-gguf:blk.4.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64}
144
+ INFO:hf-to-gguf:blk.5.attn_norm.weight, torch.bfloat16 --> F32, shape = {64}
145
+ INFO:hf-to-gguf:blk.5.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64}
146
+ INFO:hf-to-gguf:blk.5.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256}
147
+ INFO:hf-to-gguf:blk.5.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256}
148
+ INFO:hf-to-gguf:blk.5.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64}
149
+ INFO:hf-to-gguf:blk.5.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64}
150
+ INFO:hf-to-gguf:blk.5.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64}
151
+ INFO:hf-to-gguf:blk.5.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64}
152
+ INFO:hf-to-gguf:blk.5.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64}
153
+ INFO:hf-to-gguf:blk.6.attn_norm.weight, torch.bfloat16 --> F32, shape = {64}
154
+ INFO:hf-to-gguf:blk.6.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64}
155
+ INFO:hf-to-gguf:blk.6.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256}
156
+ INFO:hf-to-gguf:blk.6.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256}
157
+ INFO:hf-to-gguf:blk.6.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64}
158
+ INFO:hf-to-gguf:blk.6.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64}
159
+ INFO:hf-to-gguf:blk.6.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64}
160
+ INFO:hf-to-gguf:blk.6.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64}
161
+ INFO:hf-to-gguf:blk.6.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64}
162
+ INFO:hf-to-gguf:blk.7.attn_norm.weight, torch.bfloat16 --> F32, shape = {64}
163
+ INFO:hf-to-gguf:blk.7.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64}
164
+ INFO:hf-to-gguf:blk.7.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256}
165
+ INFO:hf-to-gguf:blk.7.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256}
166
+ INFO:hf-to-gguf:blk.7.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64}
167
+ INFO:hf-to-gguf:blk.7.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64}
168
+ INFO:hf-to-gguf:blk.7.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64}
169
+ INFO:hf-to-gguf:blk.7.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64}
170
+ INFO:hf-to-gguf:blk.7.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64}
171
+ INFO:hf-to-gguf:output_norm.weight, torch.bfloat16 --> F32, shape = {64}
172
+ Writing: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9.24M/9.24M [00:00<00:00, 139Mbyte/s]
173
+ INFO:hf-to-gguf:Model successfully exported to 'maykeye_tinyllama/Tinyllama-5M-v0.2-F16.gguf'
174
  == Generating Llamafile ==
175
  == Test Output ==
176
  note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
177
+ main: llamafile version 0.8.6
178
+ main: seed = 1717436617
179
+ llama_model_loader: loaded meta data with 29 key-value pairs and 75 tensors from Tinyllama-5M-v0.2-F16.gguf (version GGUF V3 (latest))
180
  llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
181
  llama_model_loader: - kv 0: general.architecture str = llama
182
  llama_model_loader: - kv 1: general.name str = TinyLLama
183
  llama_model_loader: - kv 2: general.author str = mofosyne
184
+ llama_model_loader: - kv 3: general.version str = v0.2
185
  llama_model_loader: - kv 4: general.url str = https://huggingface.co/mofosyne/TinyL...
186
  llama_model_loader: - kv 5: general.description str = This gguf is ported from a first vers...
187
+ llama_model_loader: - kv 6: general.license str = apache-2.0
188
+ llama_model_loader: - kv 7: general.source.url str = https://huggingface.co/Maykeye/TinyLL...
189
+ llama_model_loader: - kv 8: general.source.huggingface.repository str = Maykeye/TinyLLama-v0
190
+ llama_model_loader: - kv 9: general.parameter_weight_class str = 5M
191
+ llama_model_loader: - kv 10: llama.block_count u32 = 8
192
+ llama_model_loader: - kv 11: llama.context_length u32 = 2048
193
+ llama_model_loader: - kv 12: llama.embedding_length u32 = 64
194
+ llama_model_loader: - kv 13: llama.feed_forward_length u32 = 256
195
  llama_model_loader: - kv 14: llama.attention.head_count u32 = 16
196
+ llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
197
+ llama_model_loader: - kv 16: general.file_type u32 = 1
198
+ llama_model_loader: - kv 17: llama.vocab_size u32 = 32000
199
+ llama_model_loader: - kv 18: llama.rope.dimension_count u32 = 4
200
+ llama_model_loader: - kv 19: tokenizer.ggml.model str = llama
201
+ llama_model_loader: - kv 20: tokenizer.ggml.pre str = default
202
+ llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
203
+ llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
204
+ llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
205
+ llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1
206
+ llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2
207
+ llama_model_loader: - kv 26: tokenizer.ggml.unknown_token_id u32 = 0
208
+ llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0
209
+ llama_model_loader: - kv 28: general.quantization_version u32 = 2
210
  llama_model_loader: - type f32: 17 tensors
211
  llama_model_loader: - type f16: 58 tensors
212
  llm_load_vocab: special tokens definition check successful ( 259/32000 ).
 
272
  llama_new_context_with_model: graph nodes = 262
273
  llama_new_context_with_model: graph splits = 1
274
 
275
+ system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
276
  sampling:
277
  repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
278
  top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
 
282
  generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1
283
 
284
 
285
+ hello world the gruff man said yes. She was very happy. The man waved goodbye to the little boy and said the little boy. It was the best day ever.
286
+ The little boy was so excited. He took off his special favorite toy and a beautiful dress. He gave it to the little boy and said "thank you" to the little girl. He said "Thank you for being so clever. The man and the little boy both smiled. [end of text]
 
 
 
 
 
 
 
287
 
288
 
289
+ llama_print_timings: load time = 9.88 ms
290
+ llama_print_timings: sample time = 3.83 ms / 89 runs ( 0.04 ms per token, 23249.74 tokens per second)
291
+ llama_print_timings: prompt eval time = 1.61 ms / 8 tokens ( 0.20 ms per token, 4968.94 tokens per second)
292
+ llama_print_timings: eval time = 214.13 ms / 88 runs ( 2.43 ms per token, 410.96 tokens per second)
293
+ llama_print_timings: total time = 237.74 ms / 96 tokens
294
  Log end
295
  ```