Storing Spelling information in LLMs

#2
by MartialTerran - opened

Hi again.
I'm not sure how to direct message you on hf. So, this is just a comment on a topic that might be of interest to you.
Because you were building smallish LLMs with single-letter token vocabulary, you forced the LLM to encode the spelling of each word into a small set of tokens (e.g. 27 tokens). The largest LLMs (Google Gemini 1.5) also evidently store extensive word-spelling information for each token/word in their vocabuary (or in a portion of their vocabulary). See the evidence in the ReadMeToo.md at my new post:

https://huggingface.co/datasets/MartialTerran/Eval_Counting_Letters_in_Words/tree/main

P.S. A new open-sourced Llama-type LLM called SmolLM2 has been mega-trained on trillions of tokens, but is only less than 2B parameters. It is said to have high language coherence. Maybe check it out and evaluate it on hf, then download for local operation, or fire up a PEFT finetuning setup on your PC to see if you can get it configured to do what you want it to do.

Thank you, I have played with the smolLM but not finetuned, and really should, but work work work really gets me down at times.

Hope your projects are leading to results, or satisfaction.

Hi. 
This Dynamically_Reducing_Logit_Computation (comparable to your logit reduction by hardcoding reducing of the vocabulary before pretraining) is a serious idea that might be compatible or relevant to your small-token-set research: https://huggingface.co/MartialTerran/Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs
[As currently described, the method has no impact on input-tokens nor on pretrained model vocabulary size] It might actually be patentable.

This Self-Aware_LLM_bootup is currently a "thought experiment" (SciFi) that I had, and I think that it could have some practical applications:  https://huggingface.co/MartialTerran/Self-Aware_LLM_bootup_sequence 

This is a AI-enhanced thought salad published to inspire (or hinder) those who have capacity/resources to undertake such a massive AGI build: https://huggingface.co/MartialTerran/Artificial_General_Super_Intelligence_LLM  

A spin-off that has generated probable python code (modifying state of the art GPTs) derived from the above post has not been published yet.

MOST OF MY TIME is dealing with certain emergent real-world problems as indicated in this hackathon entry:  https://devpost.com/software/ai-decision-clerk1 

In terms of developing API LLM Apps, I am favoring Google Gemini 1.5 Pro API models and focusing on solving real world problems as illustrated in:   https://devpost.com/software/ai-decision-clerk1    [But, also keeping an eye out for downloadable models having sufficient capacity for at least local inference to support my Apps.]

Compare:  AI Legal Assistant (India)   https://devpost.com/software/ai-legal-assistant
See also https://kowallawgroup.com/should-ai-replace-law-clerks-yes-says-adam-unikowsky/ 
https://adamunikowsky.substack.com/p/in-ai-we-trust-part-ii

AI lawyers could wind up democratizing law and making legal services available to people who otherwise wouldn't have access.
https://www.themarshallproject.org/2024/02/10/ai-artificial-intelligence-attorney-court

Versus commercialized products marketed to attorneys: The world’s first generative AI legal assistant is a year old!  https://casetext.com/blog/cocounsel-first-generative-ai-legal-assistant-one-year/ 

https://www.kingselab.org/blog/hackathon-precedent-ai 

I would like to download and experiment/tinker with SmolLM2 (1.7B) since it has small parameter set, and can be trained/tuned on local PC, and fairly high coherence.  But, since it is not easily modified and there has been no publication of a SmolLM2 _model.py and SmolLM2 _tokenizer.py it is practically inaccessible to me. The deficiencies of SmolLM2 (1.7B) include, it has over 150,000 token-vocabulary (see my proposal
at https://huggingface.co/MartialTerran/Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs ) ; it is unnecessarily trained on multiple coding languages (diluted parameters)   The Huggingface makers have not published a standalone SmolLM2 _model.py and SmolLM2_tokenizer.py that operates independent of huggingface "transformers.py" and its inflexible "autotokenizer.py" (which you struggled with overcoming).   Thus actual experimentation/tinkering/development is hindered and frustrated.  See all of my remarks athttps://huggingface.co/HuggingFaceTB/SmolLM2-1.7B/discussions 

Sign up or log in to comment