Model Has Some Coherence. But only uses single-letter tokens?
This model sample prompt generated one coherent sentence. Then it later failed to generate a coherent sentence. But, it seems upon examination of your vocab and tokenizer files that the only useful vocabulary words/tokens used in this model is limited to the letters of the alphabet? " config file says: "vocab_size": 341" Is that correct? What is the purpose of using only single-letter tokenization? But, another line in the code says you use huggingface "autotokenizer" so maybe you are actually using a different tokenization scheme beyond "vocab_size": 341"?
No, you are quite correct, the tokenizer only does the standard lower case letters and a 'shift' key, and this doesn't understand upper-case characters.
I did it as I couldt work out any other way to encode easily and stay a valid tokenizer.
The bulk of the tokenizer is the hex values and special start/stop tokens.
It's an experiment in forcing the model to work out all words from scratch and see what happens.
So far it's trained on the tinystories dataset and a similar one I did for longer samples that isn't released yet.
Try starting with only lower-case letters, as the auto casifier/decasifier isn't working.
...
Yet.