Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide
This document includes detailed information, references, and notes for general parameters, samplers and advanced samplers to get the most out of your model's abilities including notes / settings for the most popular AI/LLM app in use (LLAMACPP, KoboldCPP, Text-Generation-WebUI, LMStudio, Sillytavern, Ollama and others).
These settings / suggestions can be applied to all models including GGUF, EXL2, GPTQ, HQQ, AWQ and full source/precision.
It also includes critical settings for Class 3 and Class 4 models at this repo - DavidAU - to enhance and control generation for specific as a well as outside use case(s) including role play, chat and other use case(s).
The settings discussed in this document can also fix a number of model issues (any model, any repo) such as:
- "Gibberish"
- Generation length (including out of control generation)
- Chat quality / Multi-Turn convos.
- Multi-turn / COT / and other multi prompt/answer generation
- Letter, word, phrase, paragraph repeats
- Coherence
- Instruction following
- Creativeness or lack there of or .. too much - purple prose.
- Low quant (ie q2k, iq1s, iq2s) issues.
- General output quality.
- Role play related issues.
Likewise ALL the setting (parameters, samplers and advanced samplers) below can also improve model generation and/or general overall "smoothness" / "quality" of model operation:
- all parameters and samplers available via LLAMACPP (and most apps that run / use LLAMACPP - including Lmstudio, Ollama, Sillytavern and others.)
- all parameters (including some not in Lllamacpp), samplers and advanced samplers ("Dry", "Quadratic", "Microstat") in oobabooga/text-generation-webui including llamacpp_HF loader (allowing a lot more samplers)
- all parameters (including some not in Lllamacpp), samplers and advanced samplers ("Dry", "Quadratic", "Microstat") in SillyTavern / KoboldCPP (including Anti-slop filters)
Even if you are not using my models, you may find this document useful for any model (any quant / full source / any repo) available online.
If you are currently using model(s) - from my repo and/or others - that are difficult to "wrangle" then you can apply "Class 3" or "Class 4" settings to them.
This document will be updated over time too and is subject to change without notice.
Please use the "community tab" for suggestions / edits / improvements.
IMPORTANT:
Every parameter, sampler and advanced sampler here affects per token generation and overall generation quality.
This effect is cumulative especially with long output generation and/or multi-turn (chat, role play, COT).
Likewise because of how modern AIs/LLMs operate the previously generated (quality) of the tokens generated affect the next tokens generated too.
You will get higher quality operation overall - stronger prose, better answers, and a higher quality adventure.
INDEX
How to Use this document:
Review quant(s) information to select quant(s) to download, then review "Class 1,2,3..." for specific information on models followed by "Source Files...APPS to run LLMs/AIs".
"Quick reference" will state the best parameter settings for each "Class" of model(s) to get the best operation and/or good defaults to use to get started. If you came to this page from a repo card on my repo -DavidAU- the "class" of the model would have been stated just before you came to this page.
The detailed sections about parameters - Section 1 a,b,c and section 2 will help tune the model(s) operation.
The "DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS" section after this covers and links to more information about "tuning" your model(s). These cover theory, hints, tips and tricks, and observations.
All information about parameters, samplers and advanced samplers applies to ALL models, regardless of repo(s) you download them from.
QUANTS: - QUANTS Detailed information. - IMATRIX Quants - ADDITONAL QUANT INFORMATION - ARM QUANTS / Q4_0_X_X - NEO Imatrix Quants / Neo Imatrix X Quants - CPU ONLY CONSIDERATIONSClass 1, 2, 3 and 4 model critical notes
SOURCE FILES for my Models / APPS to Run LLMs / AIs: - TEXT-GENERATION-WEBUI - KOBOLDCPP - SILLYTAVERN - OTHER PROGRAMS
TESTING / Generation Example PARAMETERS AND SAMPLERS
Quick Reference Table - Parameters, Samplers, Advanced Samplers
Section 1a : PRIMARY PARAMETERS - ALL APPS Section 1b : PENALITY SAMPLERS - ALL APPS Section 1c : SECONDARY SAMPLERS / FILTERS - ALL APPS Section 2: ADVANCED SAMPLERS
DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS: - DETAILS on PARAMETERS / SAMPLERS - General Parameters - The Local LLM Settings Guide/Rant - LLAMACPP-SERVER EXE - usage / parameters / samplers - DRY Sampler - Samplers
- Creative Writing - Benchmarking-and-Guiding-Adaptive-Sampling-DecodingADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)
QUANTS:
Please note that smaller quant(s) IE: Q2K, IQ1s, IQ2s and some IQ3s (especially those of models size 8B parameters or less) may require additional adjustment(s). For these quants you may need to increase the "penalty" sampler(s) and/or advanced sampler(s) to compensate for the compression damage of the model.
For models of 20B parameters and higher, generally this is not a major concern as the parameters can make up for compression damage at lower quant levels (IE Q2K+, but at least Q3 ; IQ2+, but at least IQ3+).
IQ1s: Generally IQ1_S rarely works for models less than 30B parameters. IQ1_M is however almost twice as stable/usable relative to IQ1_S.
Generally it is recommended to run the highest quant(s) you can on your machine ; but at least Q4KM/IQ4XS as a minimum for models 20B and lower.
The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
There is an exception to this , see "Neo Imatrix" below and "all quants" (cpu only operation).
IMATRIX:
Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
Recommended Quants - ALL:
This covers both Imatrix and regular quants.
Imatrix can be applied to any quant - "Q" or "IQ" - however, IQ1s to IQ3_S REQUIRE an imatrix dataset / imatrixing process before quanting.
This chart shows the order in terms of "BPW" for each quant (mapped below with relative "strength" to one another) with "IQ1_S" with the least, and "Q8_0" (F16 is full precision) with the most:
IQ1_S | IQ1_M IQ2_XXS | IQ2_XS | Q2_K_S | IQ2_S | Q2_K | IQ2_M IQ3_XXS | Q3_K_S | IQ3_XS | IQ3_S | IQ3_M | Q3_K_M | Q3_K_L Q4_K_S | IQ4_XS | IQ4_NL | Q4_K_M Q5_K_S | Q5_K_M Q6_K Q8_0 F16
More BPW mean better quality, but higher VRAM requirements (and larger file size) and lower tokens per second. The larger the model in terms of parameters the lower the size of quant you can run with less quality losses. Note that "quality losses" refers to both instruction following and output quality.
Differences (quality) between quants at lower levels are larger relative to higher quants differences.
The Imatrix process has NO effect on Q8 or F16 quants.
F16 is full precision, just in GGUF format.
ADDITONAL QUANT INFORMATION:
Click here for details
A great write up with charts showing various performances is provided by Artefact2 here
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
ARM QUANTS / Q4_0_X_X:
These are new quants that are specifically for computers/devices that can run "ARM" quants. If you try to run these on a "non arm" machine/device, the token per second will be VERY SLOW.
Q4_0_X_X information
These are NOT for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons on the original pull request
To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!).
If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
Click to view benchmarks on an AVX2 system (EPYC7702)
model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
---|---|---|---|---|---|---|---|
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
NEO Imatrix Quants / Neo Imatrix X Quants
NEO Imatrix quants are specialized and specifically "themed" datasets used to slightly alter the weights in a model. All Imatrix datasets do this to some degree or another, however NEO Imatrix datasets are content / theme specific and have been calibrated to have maximum effect on a model (relative to standard Imatrix datasets). Calibration was made possible after testing 50+ standard Imatrix datasets, and carefully modifying them and testing the resulting changes to determine the exact format and content which has the maximum effect on a model via the Imatrix process.
Please keep in mind that the Imatrix process (at it strongest) only "tints" a model and/or slightly changes its bias(es).
Here are some Imatrix Neo Models:
[ https://huggingface.co/DavidAU/Command-R-01-Ultra-NEO-DARK-HORROR-V1-V2-35B-IMATRIX-GGUF ]
[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ]
[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ] (this is an X-Quant)
[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-SI-FI-GGUF ]
[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-WEE-HORROR-GGUF ]
[ https://huggingface.co/DavidAU/L3-8B-Stheno-v3.2-Ultra-NEO-V1-IMATRIX-GGUF ]
Suggestions for Imatrix NEO quants:
- The LOWER the quant the STRONGER the Imatrix effect is, and therefore the stronger the "tint" so to speak
- Due to the unique nature of this project, quants IQ1s to IQ4s are recommended for maximum effect with IQ4_XS the most balanced in terms of power and bits.
- Secondaries are Q2s-Q4s. Imatrix effect is still strong in these quants.
- Effects diminish quickly from Q5s and up.
- Q8/F16 there is no change (as the Imatrix process does not affect this quant), and therefore not included.
CPU ONLY CONSIDERATIONS:
This section DOES NOT apply to most "Macs" because of the difference in O/S Memory, Vram and motherboard VS other frameworks.
Running quants on CPU will be a lot slower than running them on a video card(s).
In this special case however it may be preferred to run AS SMALL a quant as possible for token per second generation reasons.
On a top, high end (and relatively new) CPU expect token per second speeds to be 1/4 (or less) a standard middle of the road video card.
Older machines/cpus will be a lot slower - but models will STILL run on these as long as you have enough ram.
Here are some rough comparisons:
On my video card (Nvidia 16GB 4060TI) I get 160-190 tokens per second with 1B LLama 3.2 Instruct, CPU speeds are 50-60 token per second.
On my much older machine (8 years old)(2 core), token per second speed (same 1B model) is in the 10ish token per second (CPU).
Roughly 8B-12B models are limit for CPU only operation (in terms of "usable" tokens/second) - at the moment.
This is changing as new cpus come out, designed for AI usage.
Class 1, 2, 3 and 4 model critical notes:
Some of the models at my repo are custom designed / limited use case models. For some of these models, specific settings and/or samplers (including advanced) are recommended for best operation.
As a result I have classified the models as class 1, class 2, class 3 and class 4.
Each model is "classed" on the model card itself for each model.
Generally all models (mine and other repos) fall under class 1 or class 2 and can be used when just about any sampler(s) / parameter(s) and advanced sampler(s).
Class 3 requires a little more adjustment because these models run closer to the ragged edge of stability. The settings for these will help control them better, especially for chat / role play and/or other use case(s). Generally speaking, this helps them behave better overall.
Class 4 are balanced on the very edge of stability. These models are generally highly creative, for very narrow use case(s), and closer to "human prose" than other models and/or operate in ways no other model(s) operate offering unique generational abilities. With these models, advanced samplers are used to "bring these bad boys" inline which is especially important for chat and/or role play type use cases AND/OR use case(s) these models were not designed for.
For reference here are some Class 3/4 models:
[ https://huggingface.co/DavidAU/L3-Stheno-Maid-Blackroot-Grand-HORROR-16B-GGUF ]
(note Grand Horror Series contain class 2,3 and 4 models)
[ https://huggingface.co/DavidAU/L3-DARKEST-PLANET-16.5B-GGUF ]
(note Dark Planet Series contains Class 1, 2 and Class 3/4 models)
[ https://huggingface.co/DavidAU/MN-DARKEST-UNIVERSE-29B-GGUF ]
(this model has exceptional prose abilities in all areas)
[ https://huggingface.co/DavidAU/MN-GRAND-Gutenberg-Lyra4-Lyra-23.5B-GGUF ]
(note Grand Guttenberg Madness/Darkess (12B) are class 1 models, but compressed versions of 23.5B)
Although Class 3 and Class 4 models will work when used within their specific use case(s), standard parameters and settings on the model card, I recognize that users want either a smoother experience and/or want to use these models for other than intended use case(s) and that is in part why I created this document.
The goal here is to use parameters to raise/lower the power of the model and samplers to "prune" (and/or in some cases enhance) operation.
With that being said, generation "examples" (at my repo) are created using the "Primary Testing Parameters" (top of this document) settings regardless of the "class" of the model and no advanced settings, parameters, or samplers.
However, for ANY model regardless of "class" or if it is at my repo, you can now take performance to the next level with the information contained in this document.
Side note:
There are no "Class 5" models published... yet.
SOURCE FILES for my Models / APPS to Run LLMs / AIs:
Source files / Source models of my models are located here (also upper right menu on this page):
You will need the config files to use "llamacpp_HF" loader ("text-generation-webui") [ https://github.com/oobabooga/text-generation-webui ]
You can also use the full source in "text-generation-webui" too.
As an alternative you can use GGUFs directly in "KOBOLDCPP" / "SillyTavern" without the "config files" and still use almost all the parameters, samplers and advanced samplers.
Parameters, Samplers and Advanced Samplers
In section 1 a,b, and c, below are all the LLAMA_CPP parameters and samplers.
I have added notes below each one for adjustment / enhancement(s) for specific use cases.
TEXT-GENERATION-WEBUI
In section 2, will be additional samplers, which become available when using "llamacpp_HF" loader in https://github.com/oobabooga/text-generation-webui AND/OR https://github.com/LostRuins/koboldcpp ("KOBOLDCPP").
The "llamacpp_HF" (for "text-generation-webui") only requires the GGUF you want to use plus a few config files from "source repo" of the model.
(this process is automated with this program, just enter the repo(s) urls -> it will fetch everything for you)
This allows access to very advanced samplers in addition to all the parameters / samplers here.
KOBOLDCPP:
Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
You can use almost all parameters, samplers and advanced samplers using "KOBOLDCPP" without the need to get the source config files (the "llamacpp_HF" step).
Note: This program has one of the newest samplers called "Anti-slop" which allows phrase/word banning at the generation level.
SILLYTAVERN:
Note that https://github.com/SillyTavern/SillyTavern also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
You can use almost all parameters, samplers and advanced samplers using "SILLYTAVERN" without the need to get the source config files (the "llamacpp_HF" step).
For CLASS3 and CLASS4 the most important setting is "SMOOTHING FACTOR" (Quadratic Smoothing) ; information is located on this page:
https://docs.sillytavern.app/usage/common-settings/
Critical Note:
Silly Tavern allows you to "connect" (via API) to different AI programs/apps like Koboldcpp, Llamacpp (server), Text Generation Webui, Lmstudio, Ollama ... etc etc.
You "load" a model in one of these, then connect Silly Tavern to the App via API. This way you can use any model, and Sillytavern becomes the interface between the AI model and you directly. Sillytavern opens an interface in your browser.
In Sillytavern you can then adjust parameters, samplers and advanced samplers ; there are also PRESET parameter/samplers too and you can save your favorites too.
Currently, at time of this writing, connecting Silly Tavern via KoboldCPP or Text Generation Webui will provide the most samplers/parameters.
However for some, connecting to Lmstudio, LlamaCPP, or Ollama may be preferred.
NOTE:
It appears that Silly Tavern also supports "DRY" and "XTC" too ; but it is not yet in the documentation at the time of writing.
You may also want to check out how to connect SillyTavern to local AI "apps" running on your pc here:
https://docs.sillytavern.app/usage/api-connections/
OTHER PROGRAMS:
Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
In most cases all llama_cpp parameters/samplers are available when using API / headless / server mode in "text-generation-webui", "koboldcpp", "Sillytavern", "Olama", and "LMStudio" (as well as other apps too).
You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
https://github.com/ggerganov/llama.cpp
(scroll down on the main page for more apps/programs to use GGUFs too that connect to / use the LLAMA-CPP package.)
Special note:
It appears "DRY" / "XTC" samplers has been added to LLAMACPP and SILLYTAVERN.
It is available (Llamacpp) via "server.exe / llama-server.exe". Likely this sampler will also become available "downstream" in applications that use LLAMACPP in due time.
[ https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md ]
Operating Systems:
Most AI/LLM apps operate on Windows, Mac, and Linux.
Mobile devices (and O/S) are in many cases also supported.
TESTING / Generation Example PARAMETERS AND SAMPLERS
Primary Testing Parameters I use, including use for output generation examples at my repo:
Ranged Parameters:
temperature: 0 to 5 ("temp")
repetition_penalty : 1.02 to 1.15 ("rep pen")
Set parameters:
top_k:40
min_p:0.05
top_p: 0.95
repeat-last-n: 64 (also called: "repetition_penalty_range" / "rp range" )
I do not set any other settings, parameters or have samplers activated when generating examples.
Everything else is "zeroed" / "disabled".
These parameters/settings are considered both safe and default and in most cases available to all users in all AI/LLM apps.
Note for Class 3/Class 4 models (discussed below) "repeat-last-n" is a CRITICAL setting.
Quick Reference Table - Parameters, Samplers, Advanced Samplers
Compiled by: "EnragedAntelope"
https://huggingface.co/EnragedAntelope
https://github.com/EnragedAntelope
This section will get you started - especially with class 3 and 4 models - and the detail section will cover settings / control in more depth below.
Please see sections below this for advanced usage, more details, settings, notes etc etc.
# LLM Parameters Reference Table| Parameter | Description |
|----------- |-------------|
| Primary Parameters |
| temperature | Controls randomness of outputs (0 = deterministic, higher = more random). Range: 0-5 |
| top-p | Selects tokens with probabilities adding up to this number. Higher = more random results. Default: 0.9 |
| min-p | Discards tokens with probability smaller than this value × probability of most likely token. Default: 0.1 |
| top-k | Selects only top K most likely tokens. Higher = more possible results. Default: 40 |
| Penalty Samplers |
| repeat-last-n | Number of tokens to consider for penalties. Critical for preventing repetition. Default: 64 (Class 3/4 - but see notes) |
| repeat-penalty | Penalizes repeated token sequences. Range: 1.0-1.15. Default: 1.0 |
| presence-penalty | Penalizes token presence in previous text. Range: 0-0.2 for Class 3, 0.1-0.35 for Class 4 |
| frequency-penalty | Penalizes token frequency in previous text. Range: 0-0.25 for Class 3, 0.4-0.8 for Class 4 |
| penalize-nl | Penalizes newline tokens. Generally unused. Default: false |
| Secondary Samplers |
| mirostat | Controls perplexity during sampling. Modes: 0 (off), 1, or 2 |
| mirostat-lr | Mirostat learning rate. Default: 0.1 |
| mirostat-ent | Mirostat target entropy. Default: 5.0 |
| dynatemp-range | Range for dynamic temperature adjustment. Default: 0.0 |
| dynatemp-exp | Exponent for dynamic temperature scaling. Default: 1.0 |
| tfs | Tail free sampling - removes low-probability tokens. Default: 1.0 |
| typical | Selects tokens more likely than random given prior text. Default: 1.0 |
| xtc-probability | Probability of token removal. Range: 0-1 |
| xtc-threshold | Threshold for considering token removal. Default: 0.1 |
| Advanced Samplers |
| dry_multiplier | Controls DRY (Don't Repeat Yourself) intensity. Range: 0.8-1.12+ |
| dry_allowed_length | Allowed length for repeated sequences in DRY. Default: 2 |
| dry_base | Base value for DRY calculations. Range: 1.15-1.75+ for Class 4 |
| smoothing_factor | Quadratic sampling intensity. Range: 1-3 for Class 3, 3-5+ for Class 4 |
| smoothing_curve | Quadratic sampling curve. Range: 1 for Class 3, 1.5-2 for Class 4 |
Notes
- For Class 3 and 4 models, using both DRY and Quadratic sampling is recommended
- Lower quants (Q2K, IQ1s, IQ2s) may require stronger settings due to compression damage
- Parameters interact with each other, so test changes one at a time
- Always test with temperature at 0 first to establish a baseline
Section 1a : PRIMARY PARAMETERS - ALL APPS:
These parameters will have SIGNIFICANT effect on prose, generation, length and content; with temp being the most powerful.
Keep in mind the biggest parameter / random "unknown" is your prompt.
A word change, rephrasing, punctation , even a comma, or semi-colon can drastically alter the output, even at min temp settings. CAPS also affect generation too.
Likewise the size, and complexity of your prompt impacts generation too ; especially clarity and direction.
Special note:
Pre-prompts / system role are not discussed here. Many of the model repo cards (at my repo) have an optional pre-prompt you can use to aid generation (and can impact instruction following too).
Some of my newer models repo cards use a limited form of this called a "prose control" (discussed and shown by example).
Roughly a pre-prompt / system role is embedded during each prompt and can act as a guide and/or set of directives for processing the prompt and/or containing generation instructions.
A prose control is a simplifed version of this, which preceeds the main prompt(s) - but the idea / effect is relatively the same (pre-prompt/system role does have a slightly higher priority however).
I strongly suggest you research these online, as they are a powerful addition to your generation toolbox.
They are especially potent with newer model archs due to newer model types having stronger instruction following ablities AND increase context too.
PRIMARY PARAMETERS:
temp / temperature
temperature (default: 0.8)
Primary factor to control the randomness of outputs. 0 = deterministic (only the most likely token is used). Higher value = more randomness.
Range 0 to 5. Increment at .1 per change.
Too much temp can affect instruction following in some cases and sometimes not enough = boring generation.
Newer model archs (L3,L3.1,L3.2, Mistral Nemo, Gemma2 etc) many times NEED more temp (1+) to get their best generations.
top-p
top-p sampling (default: 0.9, 1.0 = disabled)
If not set to 1, select tokens with probabilities adding up to less than this number. Higher value = higher range of possible random results.
Dropping this can simplify word choices but this works in conjunction with "top-k"
I use default of: .95 ;
min-p
min-p sampling (default: 0.1, 0.0 = disabled)
Tokens with probability smaller than (min_p) * (probability of the most likely token) are discarded.
I use default: .05 ;
Careful adjustment of this parameter can result in more "wordy" or "less wordy" generation but this works in conjunction with "top-k".
top-k
top-k sampling (default: 40, 0 = disabled)
Similar to top_p, but select instead only the top_k most likely tokens. Higher value = higher range of possible random results.
Bring this up to 80-120 for a lot more word choice, and below 40 for simpler word choices.
As this parameter operates in conjunction with "top-p" and "min-p" all three should be carefully adjusted one at a time.
NOTE - "CORE" Testing with "TEMP":
For an interesting test, set "temp" to 0 ; this will give you the SAME generation for a given prompt each time.
Then adjust a word, phrase, sentence etc in your prompt, and generate again to see the differences.
(you should use a "fresh" chat for each generation)
Keep in mind this will show model operation at its LEAST powerful/creative level and should NOT be used to determine if the model works for your use case(s).
Then test your prompt(s) "at temp" to see the model in action. (5-10 generations recommended)
You can also use "temp=0" to test different quants of the same model to see generation differences. (roughly minor "BIAS" changes which reflect math changes due to compress/mixtures differences between quants).
Another option is testing different models (at temp=0 AND of the same quant) to see how each handles your prompt(s).
Then test "at temp" with your prompt(s) to see the MODELS in action. (5-10 generations recommended)
Section 1b : PENALITY SAMPLERS - ALL APPS:
These samplers "trim" or "prune" output in real time.
The longer the generation, the stronger overall effect but that all depends on "repeat-last-n" setting.
For creative use cases, these samplers can alter prose generation in interesting ways.
Penalty parameters affect both per token and part of OR entire generation (depending on settings / output length).
CLASS 4: For these models it is important to activate / set all samplers as noted for maximum quality and control.
PRIMARY:
repeat-last-n
last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size) ("repetition_penalty_range" in oobabooga/text-generation-webui , "rp_range" in kobold)
THIS IS CRITICAL.
Too high you can get all kinds of issues (repeat words, sentences, paragraphs or "gibberish"), especially with class 3 or 4 models.
Likewise if you change this parameter it will drastically alter the output.
This setting also works in conjunction with all other "rep pens" below.
This parameter is the "RANGE" of tokens looked at for the samplers directly below.
SECONDARIES:
repeat-penalty
penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) (commonly called "rep pen")
Generally this is set from 1.0 to 1.15 ; smallest increments are best IE: 1.01... 1,.02 or even 1.001... 1.002.
This affects creativity of the model over all, not just how words are penalized.
presence-penalty
repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
Generally leave this at zero IF repeat-last-n is 512-1024 or less. You may want to use this for higher repeat-last-n settings.
CLASS 3: 0.05 to .2 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
CLASS 4: 0.1 to 0.35 may assist generation BUT SET "repeat-last-n" to 64.
frequency-penalty
repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
Generally leave this at zero IF repeat-last-n is 512 or less. You may want to use this for higher repeat-last-n settings.
CLASS 3: 0.25 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
CLASS 4: 0.4 to 0.8 may assist generation BUT SET "repeat-last-n" to 64.
penalize-nl
penalize newline tokens (default: false)
Generally this is not used.
Section 1c : SECONDARY SAMPLERS / FILTERS - ALL APPS:
In some AI/LLM apps, these may only be available via JSON file modification and/or API.
For "text-gen-webui" and "Koboldcpp" these are directly accessible (and via Sillytavern IF you use either of these APPS to connect Silly Tavern to their API).
i) OVERALL GENERATION CHANGES (affect per token as well as over all generation):
mirostat
Use Mirostat sampling. "Top K", "Nucleus", "Tail Free" (TFS) and "Locally Typical" (TYPICAL) samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
"mirostat-lr"
Mirostat learning rate, parameter eta (default: 0.1) " mirostat_tau "
mirostat_tau: 5-8 is a good value.
"mirostat-ent"
Mirostat target entropy, parameter tau (default: 5.0) " mirostat_eta "
mirostat_eta: 0.1 is a good value.
Activates the Mirostat sampling technique. It aims to control perplexity during sampling. See the paper. ( https://arxiv.org/abs/2007.14966 )
This is the big one ; activating this will help with creative generation. It can also help with stability. Also note which samplers are disabled/ignored here, and that "mirostat_eta" is a learning rate.
This is both a sampler (and pruner) and enhancement all in one.
It also has two modes of generation "1" and "2" - test both with 5-10 generations of the same prompt. Make adjustments, and repeat.
CLASS 3: models it is suggested to use this to assist with generation (min settings).
CLASS 4: models it is highly recommended with Microstat 1 or 2 + mirostat_tau @ 6 to 8 and mirostat_eta at .1 to .5
Dynamic Temperature
"dynatemp-range "
dynamic temperature range (default: 0.0, 0.0 = disabled)
"dynatemp-exp"
dynamic temperature exponent (default: 1.0)
In: oobabooga/text-generation-webui (has on/off, and high / low) :
Activates Dynamic Temperature. This modifies temperature to range between "dynatemp_low" (minimum) and "dynatemp_high" (maximum), with an entropy-based scaling. The steepness of the curve is controlled by "dynatemp_exponent".
This allows the model to CHANGE temp during generation. This can greatly affect creativity, dialog, and other contrasts.
For Koboldcpp a converter is available and in oobabooga/text-generation-webui you just enter low/high/exp.
CLASS 4 only: Suggested this is on, with a high/low of .8 to 1.8 (note the range here of "1" between high and low); with exponent to 1 (however below 0 or above work too)
To set manually (IE: Api, lmstudio, Llamacpp, etc) using "range" and "exp" ; this is a bit more tricky: (example is to set range from .8 to 1.8)
1 - Set the "temp" to 1.3 (the regular temp parameter)
2 - Set the "range" to .500 (this gives you ".8" to "1.8" with "1.3" as the "base")
3 - Set exp to 1 (or as you want).
This is both an enhancement and in some ways fixes issues in a model when too little temp (or too much/too much of the same) affects generation.
ii) PER TOKEN CHANGES:
tfs
Tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
Tries to detect a tail of low-probability tokens in the distribution and removes those tokens. The closer to 0, the more discarded tokens. ( https://www.trentonbricken.com/Tail-Free-Sampling/ )
typical
Locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
If not set to 1, select only tokens that are at least this much more likely to appear than random tokens, given the prior text.
XTC
"xtc-probability"
xtc probability (default: 0.0, 0.0 = disabled)
Probability that the removal will actually happen. 0 disables the sampler. 1 makes it always happen.
"xtc-threshold"
xtc threshold (default: 0.1, 1.0 = disabled)
If 2 or more tokens have probability above this threshold, consider removing all but the last one.
XTC is a new sampler, that adds an interesting twist in generation. Suggest you experiment with this one, with other advanced samplers disabled to see its affects.
l, logit-bias TOKEN_ID(+/-)BIAS
modifies the likelihood of token appearing in the completion,
i.e. --logit-bias 15043+1
to increase likelihood of token ' Hello', or --logit-bias 15043-1
to decrease likelihood of token ' Hello'
This may or may not be available. This requires a bit more work.
Note: +- range is 0 to 100.
IN "oobabooga/text-generation-webui" there is "TOKEN BANNING":
This is a very powerful pruning method; which can drastically alter output generation.
I suggest you get some "bad outputs" ; get the "tokens" (actual number for the "word" / part word) then use this.
Careful testing is required, as this can have unclear side effects.
SECTION 2: ADVANCED SAMPLERS - "text-generation-webui" / "KOBOLDCPP" / "SillyTavern" (see note 1 below):
Additional Parameters / Samplers, including "DRY", "QUADRATIC" and "ANTI-SLOP".
Note #1 :
You can use these samplers via Sillytavern IF you use either of these APPS (Koboldcpp/Text Generation Webui) to connect Silly Tavern to their API.
Other Notes:
Hopefully ALL these samplers / controls will be LLAMACPP and available to all users via AI/LLM apps soon.
"DRY" sampler has been added to Llamacpp as of the time of this writing (and available via SERVER/LLAMA-SERVER.EXE) and MAY appear in other "downstream" apps that use Llamacpp.
INFORMATION ON THESE SAMPLERS:
For more info on what they do / how they affect generation see:
https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
(also see the section above "Additional Links" for more info on the parameters/samplers)
ADVANCED SAMPLERS - PART 1:
Keep in mind these parameters/samplers become available (for GGUFs) in "oobabooga/text-generation-webui" when you use the llamacpp_HF loader.
Most of these are also available in KOBOLDCPP too (via settings -> samplers) after start up (no "llamacpp_HF loader" step required).
I am not going to touch on all of samplers / parameters, just the main ones at the moment.
However, you should also check / test operation of (these are in Text Generation WebUI, and may be available via API / In Sillytavern (when connected to Text Generation Webui)):
a] Affects per token generation:
- top_a
- epsilon_cutoff - see note 4
- eta_cutoff - see note 4
- no_repeat_ngram_size - see note #1.
b] Affects generation including phrase, sentence, paragraph and entire generation:
- no_repeat_ngram_size - see note #1.
- encoder_repetition_penalty "Hallucinations filter" - see note #2.
- guidance_scale (with "Negative prompt" ) => this is like a pre-prompt/system role prompt - see note #3.
- Disabling (BOS TOKEN) this can make the replies more creative.
- Custom stopping strings
Note 1:
"no_repeat_ngram_size" appears in both because it can impact per token OR per phrase depending on settings. This can also drastically affect sentence, paragraph and general flow of the output.
Note 2:
This parameter if set to LESS than 1 causing the model to "jump" around a lot more , whereas above 1 causes the model to focus more on the immediate surroundings.
If the model is crafting a "scene", a setting of less than 1 causes the model to jump around the room, outside, etc etc ; if less than 1 then it focuses the model more on the moment, the immediate surroundings, the POV character and details in the setting.
Note 3:
This is a powerful method to send instructions / directives to the model on how to process your prompt(s) each time. See [ https://arxiv.org/pdf/2306.17806 ]
Note 4:
These control selection of tokens, in some case providing more relevant and/or more options. See [ https://arxiv.org/pdf/2210.15191 ]
MAIN ADVANCED SAMPLERS PART 2 (affects per token AND overall generation):
What I will touch on here are special settings for CLASS 3 and CLASS 4 models (for the first TWO samplers).
For CLASS 3 you can use one, two or both.
For CLASS 4 using BOTH are strongly recommended, or at minimum "QUADRATIC SAMPLING".
These samplers (along with "penalty" settings) work in conjunction to "wrangle" the model / control it and get it to settle down, important for Class 3 but critical for Class 4 models.
For other classes of models, these advanced samplers can enhance operation across the board.
For Class 3 and Class 4 the goal is to use the LOWEST settings to keep the model inline rather than "over prune it".
You may therefore want to experiment to with dropping the settings (SLOWLY) for Class3/4 models from suggested below.
DRY:
Dry ("Don't Repeat Yourself") affects repetition (and repeat "penalty") at the word, phrase, sentence and even paragraph level. Read about "DRY" above, in the "Additional Links" links section above.
Class 3:
dry_multiplier: .8
dry_allowed_length: 2
dry_base: 1
Class 4:
dry_multiplier: .8 to 1.12+
dry_allowed_length: 2 (or less)
dry_base: 1.15 to 1.75+
Dial the "dry_muliplier" up or down to "reign in" or "release the madness" so to speak from the core model.
For Class 4 models this is used to control some of the model's bad habit(s).
For more information on "DRY":
https://github.com/oobabooga/text-generation-webui/pull/5677
https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
QUADRATIC SAMPLING: AKA "Smoothing"
This sampler alters the "score" of ALL TOKENS at the time of generation and as a result affects the entire generation of the model. See "Additional Links" links section above for more information.
Class 3:
smoothing_factor: 1 to 3
smoothing_curve: 1
Class 4:
smoothing_factor: 3 to 5 (or higher)
smoothing_curve: 1.5 to 2.
Dial the "smoothing factor" up or down to "reign in" or "release the madness" so to speak.
In Class 3 models, this has the effect of modifying the prose closer to "normal" with as much or little (or a lot!) touch of "madness" from the root model.
In Class 4 models, this has the effect of modifying the prose closer to "normal" with as much or little (or a lot!) touch of "madness" from the root model AND wrangling in some of the core model's bad habits.
For more information on Quadratic Samplings:
https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
ANTI-SLOP - Kolbaldcpp only
Hopefully this powerful sampler will soon appear in all LLM/AI apps.
You can access this in the KoboldCPP app, under "context" -> "tokens" on the main page of the app after start up.
This sampler allows banning words and phrases DURING generation, forcing the model to "make another choice".
This is a game changer in custom real time control of the model.
For more information on ANTI SLOP project (owner runs EQBench):
https://github.com/sam-paech/antislop-sampler
FINAL NOTES:
Keep in mind that these settings/samplers work in conjunction with "penalties" ; which is especially important for operation of CLASS 4 models for chat / role play and/or "smoother operation".
For Class 3 models, "QUADRATIC" will have a slightly stronger effect than "DRY" relatively speaking.
If you use Microstat sampler, keep in mind this will interact with these two advanced samplers too.
And...
Smaller quants may require STRONGER settings (all classes of models) due to compression damage, especially for Q2K, and IQ1/IQ2s.
This is also influenced by the parameter size of the model in relation to the quant size.
IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.
DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:
Most AI / LLM apps allow saving a "profile" parameters and samplers - "favorite" settings.
Text Generation Web Ui, Koboldcpp, Silly Tavern all have this feature and also "presets" (parameters/samplers set already) too.
Other AI/LLM apps also have this feature to varying degrees too.
DETAILS on PARAMETERS / SAMPLERS:
For additional details on these samplers settings (including advanced ones) you may also want to check out:
https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
(NOTE: Not all of these "options" are available for GGUFS, including when you use "llamacpp_HF" loader in "text-generation-webui" )
Additional Links (on parameters, samplers and advanced samplers):
A Visual Guide of some top parameters / Samplers in action which you can play with and see how they interact:
https://artefact2.github.io/llm-sampling/index.xhtml
General Parameters:
https://arxiv.org/html/2408.13586v1
The Local LLM Settings Guide/Rant (covers a lot of parameters/samplers - lots of detail)
https://rentry.org/llm-settings
LLAMACPP-SERVER EXE - usage / parameters / samplers:
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
DRY
- https://github.com/oobabooga/text-generation-webui/pull/5677
- https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
- https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
Samplers:
https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
https://huggingface.co/LWDCLS/LLM-Discussions/discussions/2
https://huggingface.co/Virt-io/SillyTavern-Presets
Creative Writing :
https://www.reddit.com/r/LocalLLaMA/comments/1c36ieb/comparing_sampling_techniques_for_creative/
Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
https://github.com/ZhouYuxuanYX/Benchmarking-and-Guiding-Adaptive-Sampling-Decoding-for-LLMs
NOTE:
I have also added notes too in the sections below for almost all parameters, samplers, and advanced samplers as well.
OTHER:
Depending on the AI/LLM "apps" you are using, additional reference material for parameters / samplers may also exist.
ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)
1 - Set temp to 0 (zero) and set your basic parameters, and use a prompt to get a "default" generation. A creative prompt will work better here.
2 - If you want to test basic parameter changes, test ONE at a time, then compare output (answer quality, word choice, sentence size/construction, general output qualities) to your "default" generation.
3 - Then start testing TWO parameters at a time, and comparing again. Keep in mind parameters (all) interact with each other.
4 - Samplers -> Reset your basic parameters, (temp still at zero) and test each one of these, one at a time. Then adjust settings, test again.
5 - Once you have an "idea" of how each affects your "test prompt" , now test at "temp" (not zero). It may take five to ten generation to get a rough idea.
Yes, testing is a lot of work - but once you get all the parameter(s) and/or sampler(s) dialed in - it is worth it.
IMPORTANT: Use a "fresh chat" PER TEST (you will contaminate the results otherwise). Never use the same chat for multiple tests -> exception: Regens.
Keep in mind that parameters, samplers and advanced samplers can affect the model on a per token generation basis AND/OR on a multi-token / phrase / sentence / paragraph and even complete generation basis.
Everything is cumulative here regardless if the parameter/sampler affects per token or multi-token basis because of how models "look back" to see what was generated in some cases.
And of course... each model will be different too.
All that being said, it is a good idea to have specific generation quality "goals" in mind.
Likewise, at my repo, I post example generations so you can get an idea (but not complete picture) of a model's generation abilities.
The best way to control generation is STILL with your prompt(s) - including pre-prompts/system role. The latest gen models (and archs) have very strong instruction following so many times better (or just included!) instructions in your prompts can make a world of difference.
Not sure if the model understands your prompt(s)?
Ask it ->
"Check my prompt below and tell me how to make it clearer?" (prompt after this line)
"For my prompt below, explain the steps you wound take to execute it" (prompt after this line)
This will help the model fine tune your prompt so IT understands it.
However sometimes parameters and/or samplers are required to better "wrangle" the model and getting to perform to its maximum potential and/or fine tune it to your use case(s).