|
--- |
|
base_model: Maykeye/TinyLLama-v0 |
|
language: |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- llamafile |
|
- model-conversion |
|
- text-generation |
|
- gguf |
|
--- |
|
|
|
# TinyLLama-v0 - llamafile |
|
- Model creator: [Maykeye](https://huggingface.co/Maykeye) |
|
- Original model: [TinyLLama-v0](https://huggingface.co/Maykeye/TinyLLama-v0) |
|
|
|
## Description |
|
|
|
* This repo is targeted towards: |
|
- People who just want to quickly try out the llamafile technology by running `./Tinyllama-5M-v0.2-F16.llamafile --cli -p "hello world"` as this llamafile is only 17.6Β MB in size! |
|
- Developers who would like a quick demo on the steps to convert an existing model from safe tensor format to a gguf and packaged into a llamafile for easy distribution (Just run `llamafile-creation.sh` to retrace the steps). |
|
- Researchers who are just curious about the state of AI technology in terms of shrinking AI models, as the original model was from a replication attempt of a research paper. |
|
|
|
This repo contains [llamafile](https://github.com/Mozilla-Ocho/llamafile) format model files for [Maykeye/TinyLLama-v0](https://huggingface.co/Maykeye/TinyLLama-v0) that is a recreation of [roneneldan/TinyStories-1M](https://huggingface.co/roneneldan/TinyStories-1M) which was part of this very interesting research paper called [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) by Ronen Eldan and Yuanzhi Li. |
|
|
|
In the paper this is their abstract |
|
|
|
> Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). |
|
|
|
> In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. |
|
|
|
> We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. |
|
|
|
> We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs. |
|
|
|
While Maykeye's replication effort didn't reduce the model down to 1M parameters, Maykeye did get down to 5M parameters which is still quite an achievement as far as known replication efforts have shown. |
|
|
|
Anyway, this conversion to [llamafile](https://github.com/Mozilla-Ocho/llamafile) should give you an easy way to give this model a shot and also of the whole [llamafile](https://github.com/Mozilla-Ocho/llamafile) ecosystem in general (as it's quite quite small compared to other larger chat capable models). As this is primarily a text generation model, it will open a web server as part of the [llamafile](https://github.com/Mozilla-Ocho/llamafile) process, but it will not engage in chat as one might expect.. Instead you would give it a story prompt and it will generate a story for you. Don't expect any great stories for this size however, but it's an interesting demo on how small you can squeeze AI models and still have it generate recognisable english. |
|
|
|
## Usage In Linux |
|
|
|
```bash |
|
# if not already usable |
|
chmod +x Tinyllama-5M-v0.2-F16.llamafile |
|
|
|
# To start the llamafile in web sever mode just call this directly |
|
./Tinyllama-5M-v0.2-F16.llamafile |
|
|
|
# To start the llamafile in command line use this command |
|
./Tinyllama-5M-v0.2-F16.llamafile --cli -p "A dog and a cat" |
|
``` |
|
|
|
## About llamafile |
|
|
|
llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023. It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp binaries that run on the stock installs of six OSes for both ARM64 and AMD64. |
|
|
|
## Replication Steps Assumption |
|
|
|
* You have already pulled in all the submodules including Maykeye's model in safe.tensor format |
|
* Your git has LFS configured correctly or you get this issue https://github.com/ggerganov/llama.cpp/issues/1994 where `safe.tensor` doesn't download properly (and only a small pointer file is downloaded) |
|
* Within llama.cpp repo we already merged a [PR](https://github.com/ggerganov/llama.cpp/pull/4858) for some changes to convert.py to support metadata override (to add some missing authorship information) |
|
|
|
## Replication Steps |
|
|
|
For the most current replication steps, refer to the bash script `llamafile-creation.sh` in this repo. |
|
|
|
``` |
|
$ ./llamafile-creation.sh |
|
== Prep Enviroment == |
|
== Build and prep the llamafile engine execuable == |
|
~/huggingface/TinyLLama-v0-5M-F16-llamafile/llamafile ~/huggingface/TinyLLama-v0-5M-F16-llamafile |
|
make: Nothing to be done for 'all'. |
|
make: Nothing to be done for 'all'. |
|
~/huggingface/TinyLLama-v0-5M-F16-llamafile |
|
== What is our llamafile name going to be? == |
|
We will be aiming to generate Tinyllama-5M-v0.2-F16.llamafile |
|
== Convert from safetensor to gguf == |
|
INFO:hf-to-gguf:Loading model: maykeye_tinyllama |
|
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors' |
|
INFO:gguf.gguf_writer:gguf: Will write to maykeye_tinyllama/Tinyllama-5M-v0.2-F16.gguf |
|
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only |
|
INFO:hf-to-gguf:Set meta model |
|
INFO:hf-to-gguf:Set model parameters |
|
INFO:hf-to-gguf:gguf: context length = 2048 |
|
INFO:hf-to-gguf:gguf: embedding length = 64 |
|
INFO:hf-to-gguf:gguf: feed forward length = 256 |
|
INFO:hf-to-gguf:gguf: head count = 16 |
|
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06 |
|
INFO:hf-to-gguf:gguf: file type = 1 |
|
INFO:hf-to-gguf:Set model tokenizer |
|
INFO:gguf.vocab:Setting special token type bos to 1 |
|
INFO:gguf.vocab:Setting special token type eos to 2 |
|
INFO:gguf.vocab:Setting special token type unk to 0 |
|
INFO:gguf.vocab:Setting special token type pad to 0 |
|
INFO:hf-to-gguf:Exporting model to 'maykeye_tinyllama/Tinyllama-5M-v0.2-F16.gguf' |
|
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors' |
|
INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {64, 32000} |
|
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> F16, shape = {64, 32000} |
|
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64} |
|
INFO:hf-to-gguf:blk.0.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.0.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.0.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.0.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.0.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.0.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.0.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.1.attn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.1.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64} |
|
INFO:hf-to-gguf:blk.1.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.1.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.1.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.1.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.1.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.1.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.1.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.2.attn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.2.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64} |
|
INFO:hf-to-gguf:blk.2.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.2.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.2.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.2.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.2.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.2.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.2.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.3.attn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.3.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64} |
|
INFO:hf-to-gguf:blk.3.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.3.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.3.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.3.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.3.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.3.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.3.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.4.attn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.4.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64} |
|
INFO:hf-to-gguf:blk.4.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.4.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.4.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.4.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.4.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.4.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.4.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.5.attn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.5.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64} |
|
INFO:hf-to-gguf:blk.5.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.5.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.5.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.5.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.5.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.5.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.5.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.6.attn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.6.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64} |
|
INFO:hf-to-gguf:blk.6.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.6.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.6.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.6.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.6.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.6.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.6.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.7.attn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.7.ffn_down.weight, torch.bfloat16 --> F16, shape = {256, 64} |
|
INFO:hf-to-gguf:blk.7.ffn_gate.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.7.ffn_up.weight, torch.bfloat16 --> F16, shape = {64, 256} |
|
INFO:hf-to-gguf:blk.7.ffn_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
INFO:hf-to-gguf:blk.7.attn_k.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.7.attn_output.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.7.attn_q.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:blk.7.attn_v.weight, torch.bfloat16 --> F16, shape = {64, 64} |
|
INFO:hf-to-gguf:output_norm.weight, torch.bfloat16 --> F32, shape = {64} |
|
Writing: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 9.24M/9.24M [00:00<00:00, 139Mbyte/s] |
|
INFO:hf-to-gguf:Model successfully exported to 'maykeye_tinyllama/Tinyllama-5M-v0.2-F16.gguf' |
|
== Generating Llamafile == |
|
== Test Output == |
|
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading |
|
main: llamafile version 0.8.6 |
|
main: seed = 1717436617 |
|
llama_model_loader: loaded meta data with 29 key-value pairs and 75 tensors from Tinyllama-5M-v0.2-F16.gguf (version GGUF V3 (latest)) |
|
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. |
|
llama_model_loader: - kv 0: general.architecture str = llama |
|
llama_model_loader: - kv 1: general.name str = TinyLLama |
|
llama_model_loader: - kv 2: general.author str = mofosyne |
|
llama_model_loader: - kv 3: general.version str = v0.2 |
|
llama_model_loader: - kv 4: general.url str = https://huggingface.co/mofosyne/TinyL... |
|
llama_model_loader: - kv 5: general.description str = This gguf is ported from a first vers... |
|
llama_model_loader: - kv 6: general.license str = apache-2.0 |
|
llama_model_loader: - kv 7: general.source.url str = https://huggingface.co/Maykeye/TinyLL... |
|
llama_model_loader: - kv 8: general.source.huggingface.repository str = Maykeye/TinyLLama-v0 |
|
llama_model_loader: - kv 9: general.parameter_weight_class str = 5M |
|
llama_model_loader: - kv 10: llama.block_count u32 = 8 |
|
llama_model_loader: - kv 11: llama.context_length u32 = 2048 |
|
llama_model_loader: - kv 12: llama.embedding_length u32 = 64 |
|
llama_model_loader: - kv 13: llama.feed_forward_length u32 = 256 |
|
llama_model_loader: - kv 14: llama.attention.head_count u32 = 16 |
|
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 |
|
llama_model_loader: - kv 16: general.file_type u32 = 1 |
|
llama_model_loader: - kv 17: llama.vocab_size u32 = 32000 |
|
llama_model_loader: - kv 18: llama.rope.dimension_count u32 = 4 |
|
llama_model_loader: - kv 19: tokenizer.ggml.model str = llama |
|
llama_model_loader: - kv 20: tokenizer.ggml.pre str = default |
|
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... |
|
llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... |
|
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... |
|
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 1 |
|
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 2 |
|
llama_model_loader: - kv 26: tokenizer.ggml.unknown_token_id u32 = 0 |
|
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0 |
|
llama_model_loader: - kv 28: general.quantization_version u32 = 2 |
|
llama_model_loader: - type f32: 17 tensors |
|
llama_model_loader: - type f16: 58 tensors |
|
llm_load_vocab: special tokens definition check successful ( 259/32000 ). |
|
llm_load_print_meta: format = GGUF V3 (latest) |
|
llm_load_print_meta: arch = llama |
|
llm_load_print_meta: vocab type = SPM |
|
llm_load_print_meta: n_vocab = 32000 |
|
llm_load_print_meta: n_merges = 0 |
|
llm_load_print_meta: n_ctx_train = 2048 |
|
llm_load_print_meta: n_embd = 64 |
|
llm_load_print_meta: n_head = 16 |
|
llm_load_print_meta: n_head_kv = 16 |
|
llm_load_print_meta: n_layer = 8 |
|
llm_load_print_meta: n_rot = 4 |
|
llm_load_print_meta: n_embd_head_k = 4 |
|
llm_load_print_meta: n_embd_head_v = 4 |
|
llm_load_print_meta: n_gqa = 1 |
|
llm_load_print_meta: n_embd_k_gqa = 64 |
|
llm_load_print_meta: n_embd_v_gqa = 64 |
|
llm_load_print_meta: f_norm_eps = 0.0e+00 |
|
llm_load_print_meta: f_norm_rms_eps = 1.0e-06 |
|
llm_load_print_meta: f_clamp_kqv = 0.0e+00 |
|
llm_load_print_meta: f_max_alibi_bias = 0.0e+00 |
|
llm_load_print_meta: f_logit_scale = 0.0e+00 |
|
llm_load_print_meta: n_ff = 256 |
|
llm_load_print_meta: n_expert = 0 |
|
llm_load_print_meta: n_expert_used = 0 |
|
llm_load_print_meta: causal attn = 1 |
|
llm_load_print_meta: pooling type = 0 |
|
llm_load_print_meta: rope type = 0 |
|
llm_load_print_meta: rope scaling = linear |
|
llm_load_print_meta: freq_base_train = 10000.0 |
|
llm_load_print_meta: freq_scale_train = 1 |
|
llm_load_print_meta: n_yarn_orig_ctx = 2048 |
|
llm_load_print_meta: rope_finetuned = unknown |
|
llm_load_print_meta: ssm_d_conv = 0 |
|
llm_load_print_meta: ssm_d_inner = 0 |
|
llm_load_print_meta: ssm_d_state = 0 |
|
llm_load_print_meta: ssm_dt_rank = 0 |
|
llm_load_print_meta: model type = ?B |
|
llm_load_print_meta: model ftype = F16 |
|
llm_load_print_meta: model params = 4.62 M |
|
llm_load_print_meta: model size = 8.82 MiB (16.00 BPW) |
|
llm_load_print_meta: general.name = TinyLLama |
|
llm_load_print_meta: BOS token = 1 '<s>' |
|
llm_load_print_meta: EOS token = 2 '</s>' |
|
llm_load_print_meta: UNK token = 0 '<unk>' |
|
llm_load_print_meta: PAD token = 0 '<unk>' |
|
llm_load_print_meta: LF token = 13 '<0x0A>' |
|
llm_load_tensors: ggml ctx size = 0.04 MiB |
|
llm_load_tensors: CPU buffer size = 8.82 MiB |
|
.............. |
|
llama_new_context_with_model: n_ctx = 512 |
|
llama_new_context_with_model: n_batch = 512 |
|
llama_new_context_with_model: n_ubatch = 512 |
|
llama_new_context_with_model: flash_attn = 0 |
|
llama_new_context_with_model: freq_base = 10000.0 |
|
llama_new_context_with_model: freq_scale = 1 |
|
llama_kv_cache_init: CPU KV buffer size = 1.00 MiB |
|
llama_new_context_with_model: KV self size = 1.00 MiB, K (f16): 0.50 MiB, V (f16): 0.50 MiB |
|
llama_new_context_with_model: CPU output buffer size = 0.12 MiB |
|
llama_new_context_with_model: CPU compute buffer size = 62.75 MiB |
|
llama_new_context_with_model: graph nodes = 262 |
|
llama_new_context_with_model: graph splits = 1 |
|
|
|
system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | |
|
sampling: |
|
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 |
|
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 |
|
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 |
|
sampling order: |
|
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature |
|
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1 |
|
|
|
|
|
hello world the gruff man said yes. She was very happy. The man waved goodbye to the little boy and said the little boy. It was the best day ever. |
|
The little boy was so excited. He took off his special favorite toy and a beautiful dress. He gave it to the little boy and said "thank you" to the little girl. He said "Thank you for being so clever. The man and the little boy both smiled. [end of text] |
|
|
|
|
|
llama_print_timings: load time = 9.88 ms |
|
llama_print_timings: sample time = 3.83 ms / 89 runs ( 0.04 ms per token, 23249.74 tokens per second) |
|
llama_print_timings: prompt eval time = 1.61 ms / 8 tokens ( 0.20 ms per token, 4968.94 tokens per second) |
|
llama_print_timings: eval time = 214.13 ms / 88 runs ( 2.43 ms per token, 410.96 tokens per second) |
|
llama_print_timings: total time = 237.74 ms / 96 tokens |
|
Log end |
|
``` |
|
|