Model Card for NinjaMouse-2.4B-32L-danube

❗ This model gives up when the input reaches a critical mass of about tree fiddy thousand tokens

It may be an issue with the base danube model as it does the exact same thing, but H2O.ai released another version of it. Danube2 theoretically has a smaller context window while being larger in practice. I have tested it, and it works great even up to 8k. It's already training in the dojo. Stay tuned if you like more silly models like this.

A lanky version of h2o-danube's tiny language model, stretched from 24 layers to 32. I have done this in steps, adding 2 new layers per step and training them on different datasets. This seems to have made it a quick learner, and easily fit an 8GB GPU for finetuning using Unsloth for optimizations. This model is designed to be a gateway into bigger language models.

This model is sponsored by Advanced Vintage Memeatics. A powerful dopaminergic with ties to the holy roman empire, the ghost of Richard Feynman, a radiator from the Radiator planet, and the gods defying Babel Fish. Consult your shaman before use. If their voodoo is strong you can find the even longer and even more uncut 3B model here.

Model Details

Model Description

Two of the datasets I used to train this model was WhiteRabbitNeo chapter 1 and chapter 2 thereby agreeing to their extended Apache 2 license. If you use this model, a derivative, you really have to read their terms and agree with them, which is an easy task as they are quite reasonable. They could also be called the "Don't be a dick" clause (see out of scope section).

With the important things covered, let us cover the model.

I wanted a model that could construct Stable Diffusion prompts without the "trending on artstation, 8k uhd, dramatic lighting, detailed, masterpiece" spam. My solution got a little out of hand when trying out deep block expansion, DoRA and QLoRA on the TinyLlama model, which failed but lead to this. A natty 16k context window, can be trained using Unsloth, and seems to be a lot more coherent than both TinyLlama and Phi2.

My thoughts going in to this was "If I use WRN in the training I get to call it something related to The Matrix" and "These Stable Diffusion prompt datasets need Geepus." After weeks of looking intensely at small numbers decreasing by very small amounts, I present to you a tiny language model that can generate image prompts and it has got a funny name.

Developed by: Trolle Karlsson (Pseudonym Anonymous)
Model type: Mistral
Language(s) (NLP): English
License: Apache-2.0 + WhiteRabbitNeo Extended Version
Finetuned from model: h2o-danube

Uses

Imagine having a model going through an entire book, page by page, creating SDXL prompts for the highlights. I want that! I would think that such a task would require some solid training data which I do not have. What I do have is my own set of about 700 instructions ranging from "write an SD(XL) prompt where something, something, something dark side shit is going on" through "Convert this image prompt from SD to SDXL" to "Inspiration: crocs."

The small size of the model, the diverse open datasets used in training, and the large context size could be great for RAG applications, but that is also the reason that additional finetuning is sort of required for this model to work in a consistent manner.

I think SOLAR and Llama Pro shows us that our current models benefit from being stretched a bit. That quantization works at all is also an implication that our models are too precise. However, rounding errors can introduce unforeseen bugs like suddenly being unable to spell, or in my case where the responses became #'s. It might have been brainfuck, but I barely write Python. Use vanilla at your own risk.

Direct Use

Here is what I can do with Stable Diffusion text prompts:

Make SD image prompts by asking it nicely
Transform those from SD to SDXL and back
Improve prompts by removed legacy tags
Inspire from only a single word
TODO: Story/Lyrics to image prompt
TODO: Reverse image prompt (for further dataset development reasons)
TODO: Have all the other stuff work, beside SD prompting, at most temperatures.

Downstream Use

This isn't even the final form, 40 layers will be. Adding more at that point is just silly. By then it will have gone from 1.8B parameters to 3B. You can expand it even further, and I would like to know the results. I urge you to do your own training though. Small language models are prone to going off the rails and do whatever.

Out-of-Scope Use

Do NOT use this model, or any derivatives, to be an ass.

You agree not to use the Model or Derivatives of the Model:

-	In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party; 
-	For military use in any way;
-	For the purpose of exploiting, harming or attempting to exploit or harm minors in any way; 
-	To generate or disseminate verifiably false information and/or content with the purpose of harming others; 
-	To generate or disseminate inappropriate content subject to applicable regulatory requirements;
-	To generate or disseminate personal identifiable information without due authorization or for unreasonable use; 
-	To defame, disparage or otherwise harass others; 
-	For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation; 
-	For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics; 
-	To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm; 
-	For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.

I do however want you to explore the realms of extreme language compression. AGI is not that far away. Even if the required compute takes time to allocate, DistilAGI or something similar would surface.

Bias, Risks, and Limitations

There are some sultry prompts in my proprietary dataset, but I'm not high enough on the spectrum to delve into Pony prompting. Filtering SD datasets from space worms and worse took its toll.

I am hesitant to upload my dataset because of that. I also feel that, even though it's only about 700 samples, makes the responses a bit weird. It could also stem from the reddit writing prompts. Lets get professor Tegmark on the case!

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_name = "trollek/NinjaMouse-2.4B-32L-danube"

tokeniser = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda")

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokeniser,
    device=0,
)

system_prompt = "You are a very clever and helpful AI assistant called NinjaMouse."  
intro_prompt = "Please introduce yourself."
messages = [
    {
        "role": "system",
        "content": system_prompt,
    },
    {
        "role": "user", 
        "content": f"{intro_prompt}"},
]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = pipeline(prompt, max_new_tokens=512, do_sample=True, temperature=0.65, top_k=45, top_p=0.90)

print(outputs[0]["generated_text"])

Training Details

Training Data

The datasets I've used to train this model are diverse, sandwiched as the middle and last layer when expanding. They are the following:

Step 1: 24->26

LDJnr/Capybara
vicgalle/alpaca-gpt4
jondurbin/airoboros-3.2
teknium/GPTeacher-General-Instruct
WhiteRabbitNeo/WRN-Chapter-1
WhiteRabbitNeo/WRN-Chapter-2

Step 1.5

Mihaiii/OpenHermes-2.5-1k-longest-curated

Notes: As I understand, for models to fully use their context windows they have to be trained on long texts. I suppose when this one gets to a size around 3B it will have an easier time with long complex texts.

Step 2: 26->28

abacusai/SystemChat
TIGER-Lab/MathInstruct
reinforz/question_generation_data

Step 3: 28->30

euclaise/WritingPrompts_curated (heavily filtered - 6k)
HuggingFaceTB/cosmopedia-100k (textbooks and stories)
derek-thomas/squad-v1.1-t5-question-generation
dim/roleplay_instruct_v2_final

Notes: This step was somewhat of a failure. I see good results when aiming for a training loss in the range of .5-.9 but this type of writing is not within its grasp yet.

Step 4: 30->32

m-a-p/Code-Feedback
m-a-p/CodeFeedback-Filtered-Instruction
glaiveai/glaive-code-assistant-v2
Toolcall 10k

Step 4.5

I figured a last once over with the datasets from step 1 and the SD one wouldn't hurt.

Training Procedure

I'll be honest with you splendid folks. This has taken a great deal of waiting patiently for Llama Factory to do what can be considered magic, but also a 4060Ti and a lot of time clicking around this site to find that datasets like Capybara are considered of the highest quality. With that, and "baby steps" in mind I selected training data that emulated our own knowledge progression. From right after the "what if dirt tastes amazing though?" stage of our life.

We learn to speak, we relate the 10 wiggly appendages on our hands to a number system, and finally we stare in awe at electrons defying time. With that said; Cosmopedia was a hassle. I strive for a training loss of <1, and Cosmopedia+WritingPrompt+Question Generation would not get below 1.1, which were in the step where I expanded the model from 28 to 30 layers. It started with a loss of 2, so it wasn't all bad.

Preprocessing

If a dataset is not compatible with Llama Factory I just open up a new Jupyter Notebook and work my way through analyzing the data and formatting it in a way I can use. For the writing prompts I filtered subreddit mentions, weird formatting like "[WP]", "*********", "_______", and by upvotes. I did the same for Cosmopedia-100k by removing reddit and children stories.

Training Hyperparameters

Training regime:
- LR: 0.0001-0.0004
- LR Scheduler: cosine
- Warmup: 1%
- Batch size: 2-4
- Gradient accumulation: 2-4
- Epochs: 2-8

I yolo the parameters from educated guesstimates. Reading through some of the code for Unsloth and Huggingface I got an idea of what to write after -- in the terminal. --helpalso helps out a lot.

Roadmap (The Olsenbanden plan)

32->34

Logic: Synthia, AutoCoT, STEM (from CamelAI)

34->36

Math: MetaMath

36->38

Writing/Translation: Writing Prompts, Cosmopedia, xP3x (danish question and command translations)

Notes: Before starting to finetune on a danish dataset like danish-OpenHermes I would like to try out teaching it translation tasks first. I still feel that when modelling algorithms after our own brains, that we can think of NNs in the same way we do with our own meaty prediction machines. Motion -> Language -> Lies -> Math/Reason -> Life is the pathway I'm trying with this model, but without the robotics.

38->40

Repetition: CodeFeedback, question gen, RAG (perhaps LongAlapaca), Roleplay

MoE Rodents

This is inspired by Beyonder. I think that with carefully selected positive prompting and a finetune on a large diverse dataset like Hercules, Hyperion, or OpenHermes, when using the same model x4, will make a big difference. 4 trained individually seems to work, but being able to test hypotheses on a smaller scale would be great to see more of.

4x3B NinjaMice

Expert 1: Chat/Assist

Expert 2: Programming

Expert 3: Writing/Creative

Expert 4: Reason/Math

trollek
/

NinjaMouse-2.4B-32L-danube