SerialKicked
commited on
Commit
•
7e65b3f
1
Parent(s):
f6fde87
Update README.md
Browse files
README.md
CHANGED
@@ -35,7 +35,7 @@ Simply put, I'm making my methodology to evaluate RP models public. While none o
|
|
35 |
|
36 |
### DoggoEval
|
37 |
|
38 |
-
The goal of this test featuring
|
39 |
|
40 |
- [Results and discussions are hosted in this thread](https://huggingface.co/SerialKicked/ModelTestingBed/discussions/1) ([old thread here](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13))
|
41 |
- [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
|
@@ -51,6 +51,6 @@ TODO: The goal of this test is to check if a model is able of following a very s
|
|
51 |
|
52 |
# Limitations
|
53 |
|
54 |
-
I'm testing for things I'm interested in.
|
55 |
|
56 |
I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.
|
|
|
35 |
|
36 |
### DoggoEval
|
37 |
|
38 |
+
The goal of this test, featuring a dog (Rex) and his owner (EsKa), is to determine if a model is good at obeying a system prompt and character card. The trick being that dogs can't talk, but LLM love to.
|
39 |
|
40 |
- [Results and discussions are hosted in this thread](https://huggingface.co/SerialKicked/ModelTestingBed/discussions/1) ([old thread here](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13))
|
41 |
- [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
|
|
|
51 |
|
52 |
# Limitations
|
53 |
|
54 |
+
I'm testing for things I'm interested in. I do not pretend any of this is very scientific or accurate: as much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or with the smallest of change, are bound to give very different results.
|
55 |
|
56 |
I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.
|