ModelTestingBed / README.md
SerialKicked's picture
Update README.md
a677f11 verified
|
raw
history blame
2.26 kB
metadata
language:
  - en
tags:
  - testing
  - llm
  - rp
  - discussion

I'll be using this space to post my (very personal) methodology to test interesting models I come across. In the meantime, you can check this topic to check the first test category I'm releasing (thanks to Lewdiculous for hosting)

Testing Environment

All models are loaded in Q8_0 (GGUF) using KoboldCPP 1.65 for Windows using CUDA 12. Using CuBLAS but not using mmq. All layers are on the GPU (NVidia RTX3060 12GB).

Frontend is staging version of Silly Tavern.

All models are extended to 16K context length (auto rope from KCPP) with Flash Attention enabled. Response size set to 1024 tokens max.

Fixed Seed for all tests: 123

Instruct Format

All models are tested in whichever instruct format they are supposed to be comfortable with.

However, mergers (as you are the main culprits of doing that), I'm not going to hunt through tons of parent models to try to determine which instruct format you're using (nor are most people). If it's not on your page, I'll assume L3 instruct for Llama models and ChatML for Mistral ones. If you're using neither of them, nor are you using Alpaca, I'm not testing your model.

System Prompt

[[TODO: add files to repo]]

Available Tests

[[TODO: add to discussions]]

  • Dog Persona Test - Testing the ability for the model to follow a card despite user actions (and natural inclination of a LLM). Ability to compartmentalize actions and dialogs.
  • Long Context Test - Various tasks to be executed at full context
  • Group Coherence Test - Testing models in group settings

Limitations

I'm testing for things I'm interested in. I do not pretend any of this is scientific or accurate. As much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or the smallest change, are bound to give very different results.

I gave the models a fair shake in more casual settings, regen tons of outputs with random seeds (see individual tests), and while there are (large) variations, it tends to even out to the results shown in testing.