Model Has Some Coherence. But only uses single-letter tokens?

by MartialTerran - opened Apr 1

Apr 1

•

This model sample prompt generated one coherent sentence. Then it later failed to generate a coherent sentence. But, it seems upon examination of your vocab and tokenizer files that the only useful vocabulary words/tokens used in this model is limited to the letters of the alphabet? " config file says: "vocab_size": 341" Is that correct? What is the purpose of using only single-letter tokenization? But, another line in the code says you use huggingface "autotokenizer" so maybe you are actually using a different tokenization scheme beyond "vocab_size": 341"?

Corianas

Owner Apr 2

No, you are quite correct, the tokenizer only does the standard lower case letters and a 'shift' key, and this doesn't understand upper-case characters.

I did it as I couldt work out any other way to encode easily and stay a valid tokenizer.

The bulk of the tokenizer is the hex values and special start/stop tokens.

It's an experiment in forcing the model to work out all words from scratch and see what happens.

So far it's trained on the tinystories dataset and a similar one I did for longer samples that isn't released yet.

Try starting with only lower-case letters, as the auto casifier/decasifier isn't working.

...
Yet.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment