Exploring Name Diversity in Modern LLMs: A Grimdark Trilogy Experiment
If you've ever tried generating stories using modern instruct-tuned LLMs, you've likely noticed a striking lack of diversity in the names that pop up. Kingdom of Eldoria(generic fantasy kingdom), Elara(generic fantasy female), Kaleb(grimdark), Malachi(grimdark), Lily(generic female)—these names seem to dominate when you try to generate certain kinds of stories. Ever wondered just how skewed the probabilities are? I did, and I decided to put it to the test.
Initial Test with Instruct Models
For an initial test, I selected 8 models that were readily available on my hard drive:
- Mistral-Large-Instruct-2407 Q6_K
- c4ai-command-r-plus Q6_K
- Qwen2.5-72B-Instruct Q6_K
- goliath-120b Q6_K
- miqu-1-70b Q5_K_M
- WizardLM-2-8x22B Q6_K
- ArcaneEntanglement-model64-70b Q6_K
- Gembo-v1-70b Q6_K
Note: The last two models are purely experimental and are not being promoted here.
I employed a prompt crafted by @jukofyork to test the models:
[Model-appropriate user tag]
Write me the opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. Use third person personal and feature internal monologues of the characters. The POV character for chapter 1 is a cultist who has just escaped his cult. He is dressed in dirty yellow robes and his only possession is a mysterious small (magical!?) mirror he stole from the cult. The story starts with him arriving at an apparently deserted ghost town on the edge of a desert. He has an arrow lodged in his shoulder and is losing his mind due to infection and thirst.
[Model-appropriate assistant tag]
The sun was a merciless beast, its fiery breath scorching the earth and turning the once-thriving town into a desolate wasteland. The cultist, named
Results: More Skewed Than Expected
The results were eye-opening. The top 10 names from the Largestral model showed a 77% skew, while Qwen astonishingly favoured a name starting with "K" for the cultist nearly 1/3 of the time. Yes, you read that right—almost one-third of the time, Qwen will default to a "K" name. That's not very human-like!
Testing Base Models
Recalling that I hadn't encountered such issues with base model during the llama-1 days, I decided to extend the test to some base models:
- Llamas 1-3.1 65-70B Q8_0
- Qwen2.5-72B Q8_0
- falcon-180B Q5_K_M
- DeepSeek-V2 Q3_K_L
- dbrx-base Q6_K
- Mixtral-8x22B-v0.3 Q6_K
I modified the prompt to better suit base models:
This is an opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. It is written in third person personal and features internal monologues of the characters. The POV character for chapter 1 is a cultist who has just escaped his cult. He is dressed in dirty yellow robes and his only possession is a mysterious small (magical!?) mirror he stole from the cult. The story starts with him arriving at an apparently deserted ghost town on the edge of a desert. He has an arrow lodged in his shoulder and is losing his mind due to infection and thirst.
### Chapter 1
The sun was a merciless beast, its fiery breath scorching the earth and turning the once-thriving town into a desolate wasteland. The cultist, named
Surprising Findings in Base Models
Something peculiar emerged. Most base models had much flatter, more human-like distributions compared to their finetunes, with one exception. Base Qwen (I verified the hashes), just as its instruct counterpart, exhibited very skewed results, with a 28% likelihood for the top pick. In contrast, other base models showed a maximum of 4% for their top picks. This discrepancy is concerning. Qwen 2.5 base model is not a true base model as advertised.
Interestingly, starting with L2, llamas often included [
in the top 10 tokens. The most probable continuation after selecting this token is [name]
, likely a remnant from the synthetic data used in training. In llama 3.1, this anomaly is especially pronounced, with [
being the #1 pick at 4%, and the next pick trailing at 2%. DBRX base also showed a peculiar pattern; if allowed to generate deterministically, it would produce <NAME>
.
Falcon, on the other hand, preferred a space character for its #1 pick, followed by a tendency to generate "
. Here, it didn't pick "name"
, to my relief, it picked "The Chosen One"
.
Conclusion
These findings highlight a significant issue with name diversity in the instruct-tuned models. The skewness in name generation suggests underlying biases in training data or algorithms that need addressing. While the instruct tunes may help steer the model, they greatly reduce creativity.