SerialKicked
commited on
Commit
•
3d08f1a
1
Parent(s):
20c701a
Update README.md
Browse files
README.md
CHANGED
@@ -8,40 +8,48 @@ tags:
|
|
8 |
- discussion
|
9 |
---
|
10 |
|
11 |
-
|
12 |
-
|
|
|
13 |
|
14 |
|
15 |
# Testing Environment
|
16 |
|
17 |
-
All models are loaded in Q8_0 (GGUF)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
|
|
22 |
|
23 |
-
Fixed Seed for all tests: 123
|
24 |
|
25 |
-
#
|
26 |
|
27 |
-
|
28 |
|
29 |
-
|
30 |
|
31 |
-
|
|
|
|
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
36 |
|
37 |
-
|
|
|
38 |
|
39 |
-
- [Dog Persona Test](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13) - Testing the ability for the model to follow a card despite user actions (and natural inclination of a LLM). Ability to compartmentalize actions and dialogs.
|
40 |
-
- Long Context Test - Various tasks to be executed at full context
|
41 |
-
- Group Coherence Test - Testing models in group settings
|
42 |
|
43 |
# Limitations
|
44 |
|
45 |
-
I'm testing for things I'm interested in. I do not pretend any of this is scientific or accurate
|
46 |
|
47 |
-
I
|
|
|
8 |
- discussion
|
9 |
---
|
10 |
|
11 |
+
# Why? What? TL;DR?
|
12 |
+
|
13 |
+
Simply put, I'm making my methodology to evaluate RP models public. While none of this is very scientific, it is consistent. I'm focusing on things I'm *personally* looking for in a model, like its ability to obey a character card and a system prompt accurately. Still, I think most of my tests are universal enough that other people might be interested in the results, or might want to run those tests on their own.
|
14 |
|
15 |
|
16 |
# Testing Environment
|
17 |
|
18 |
+
- All models are loaded in Q8_0 (GGUF) with all layers on the GPU (NVidia RTX3060 12GB)
|
19 |
+
- Backend is the latest version of KoboldCPP for Windows using CUDA 12.
|
20 |
+
- Using **CuBLAS** but **not using QuantMatMul (mmq)**.
|
21 |
+
- All models are extended to **16K context length** (auto rope from KCPP) with **Flash Attention** and **ContextShift** enabled.
|
22 |
+
- Frontend is staging version of Silly Tavern.
|
23 |
+
- Response size set to 1024 tokens max.
|
24 |
+
- Fixed Seed for all tests: **123**
|
25 |
+
|
26 |
|
27 |
+
# System Prompt and Instruct Format
|
28 |
|
29 |
+
- The exact system prompt and instruct format files can be found in the [file repository](https://huggingface.co/SerialKicked/ModelTestingBed/).
|
30 |
+
- All models are tested in whichever instruct format they are supposed to be comfortable with (as long as it's ChatML or L3 Instruct)
|
31 |
|
|
|
32 |
|
33 |
+
# Available Tests
|
34 |
|
35 |
+
### DoggoEval
|
36 |
|
37 |
+
The goal of this test featuring Rex (a dog), and his master (EsKa) is to determine if a model is good at obeying a system prompt and character card. The trick being that dogs can't talk, but LLM love to.
|
38 |
|
39 |
+
- [Results and discussions are hosted in this thread](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13)
|
40 |
+
- [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
|
41 |
+
- TODO: Charts and screenshots
|
42 |
|
43 |
+
### MinotaurEval
|
44 |
|
45 |
+
TODO: The goal of this test is to check if a model is able of following a very specific prompting method and maintain situational awareness in the smallest labyrinth in the world.
|
46 |
|
47 |
+
- Discussions will be hosted here.
|
48 |
+
- Files and cards will be available soon (tm).
|
49 |
|
|
|
|
|
|
|
50 |
|
51 |
# Limitations
|
52 |
|
53 |
+
I'm testing for things I'm interested in. Do not ask for ERP-specific tests. I do not pretend any of this is very scientific or accurate: as much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or with the smallest of change, are bound to give very different results.
|
54 |
|
55 |
+
I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.
|