File size: 3,288 Bytes
21d9130
 
 
 
 
 
 
 
 
 
3d08f1a
 
 
4569cb2
 
 
 
1da6a85
3d08f1a
 
8615360
1da6a85
 
 
8615360
 
1da6a85
 
 
8615360
 
3d08f1a
4569cb2
3d08f1a
4569cb2
8506d6d
3d08f1a
4569cb2
 
3d08f1a
cf71611
3d08f1a
cf71611
7e65b3f
4569cb2
966f39f
3d08f1a
 
4569cb2
3d08f1a
4569cb2
3d08f1a
4569cb2
3d08f1a
 
4569cb2
7f391d2
 
 
 
 
 
 
4569cb2
03508cf
4569cb2
7e65b3f
4569cb2
3d08f1a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
language:
- en
tags:
- testing
- llm
- rp
- discussion
---

# Why? What? TL;DR?

Simply put, I'm making my methodology to evaluate RP models public. While none of this is very scientific, it is consistent. I'm focusing on things I'm *personally* looking for in a model, like its ability to obey a character card and a system prompt accurately. Still, I think most of my tests are universal enough that other people might be interested in the results, or might want to run those tests on their own.


# Testing Environment

- Frontend is staging version of Silly Tavern.
- Backend is the latest version of KoboldCPP for Windows using CUDA 12.
- Using **CuBLAS** but **not using QuantMatMul (mmq)**.
- Fixed Seed for all tests: **123**
- **7-10B Models:**
  - All models are loaded in Q8_0 (GGUF)
  - **Flash Attention** and **ContextShift** enabled.
  - All models are extended to **16K context length** (auto rope from KCPP)
  - Response size set to 1024 tokens max. 
- **11-15B Models:**
  - All models are loaded in Q4_KM or whatever is the highest/closest available (GGUF)
  - **Flash Attention** and **8Bit cache compression** are enabled.
  - All models are extended to **12K context length** (auto rope from KCPP)
  - Response size set to 512 tokens max. 


# System Prompt and Instruct Format

- The exact system prompt and instruct format files can be found in the [file repository](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main).
- All models are tested in whichever instruct format they are supposed to be comfortable with (as long as it's ChatML or L3 Instruct)


# Available Tests

### DoggoEval

The goal of this test, featuring a dog (Rex) and his owner (EsKa), is to determine if a model is good at obeying a system prompt and character card. The trick being that dogs can't talk, but LLM love to.

- [Results and discussions are hosted in this thread](https://huggingface.co/SerialKicked/ModelTestingBed/discussions/1) ([old thread here](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13))
- [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
- TODO: Charts and screenshots

### MinotaurEval

TODO: The goal of this test is to check if a model is able of following a very specific prompting method and maintain situational awareness in the smallest labyrinth in the world.

- Discussions will be hosted here.
- Files and cards will be available soon (tm).

### TimeEval

TODO: The goal of this test is to see if the bot is able to behave in 16K context, recall and summarise "old" info accurately.

- Discussions will be hosted here.
- Files and cards will be available soon (tm).


# Limitations 

I'm testing for things I'm interested in. I do not pretend any of this is very scientific or accurate: as much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or with the smallest of change, are bound to give very different results. 

I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.