bunch of questions

by CoderCowMoo - opened Apr 26

Apr 26

•

Hi, since loading is done in 4 bits anyways, wouldn't it be best to change the model id to unsloth/llama-3-8b-Instruct-bnb-4bit rather than unsloth/llama-3-8b-Instruct? ( I've tried this now on my system, works fine with just a name change. Only note is that it uses bfloat16 and transformers prioritises it over the bnb_config we have so no float16)

Also sorry for nitpicking, was seeing the torch.inference_mode() docs and it might be best to use the decorator form @torch.inference_mode() on the function.

Question: Why do you add 2 image_token_index's if a bos token is found (in tokenizer_image_token)?

Question: Not super familiar with python packaging naming, is main so that if it was packaged, its the first thing to be executed?

Question: Why use the second last hidden layer (which is the 3rd last overall layer I think?), instead of the last layer?

Question: Also why is a single column taken away from the image_features?

Question: I'd be really interested to learn how the training was done. Even in the AnyMAL paper, at least a LoRA is needed to pass on the learnt multimodal relation, while this release is just the projection layers right?
https://huggingface.co/blog/vlms
https://huggingface.co/blog/AviSoori1x/seemore-vision-language-model

Answer: Projection is trained alone with siglip and llama 3 frozen. Finetuning isn't done because then the llm is unfrozen and it'd have to be included right?

Question: If I'm correct, final image embedding take 4096 tokens instead of 1152 tokens right?

Finally, I'd like to thank you for such a powerful release in literally 40MB. My 500gb ssd thanks you.
I understand this is a lot to process, please feel free to answer whichever questions you want at whatever time you please, and thanks again.

CoderCowMoo changed discussion title from model id and nitpick to bunch of questions Apr 26

qtnx

qresearch org Apr 26

•

edited Apr 26

thanks for writing this:

for the nitpicks, you're correct and we'll fix those, but since this is an alpha we might keep a full bf16 model as we expect current users to be able to modify the quantization config or switch model (to maybe test finetunes of llama later, we'll see)
Q1: we'll have to check but normally this shouldn't be the case
Q2: yes __main__ is just naming if it were in a module but doesn't matter much for now
Q3: this is due to testing in different papers (i think llava but i'm not sure, will have to check) finding that using the penultimate layer of an encoder is better
Q4: this works for a CLIP model, as we want to remove the [CLS] token, but this is an oversight for SigLIP: we forgot to remove it so thanks for finding that, don't ship at 2AM!
Q5: the final embedding dimension size of the image features is 4096 as it is mapped to llama's embedding space

qtnx changed discussion status to closed Apr 26

qtnx

qresearch org Apr 26

also your own answer is correct, this is just pretraining not pretraining+finetuning

CoderCowMoo

Apr 26

Thanks so much for taking the time, one last question and I'm done haha.

In regards to Q4, what effect do you think removing that has on the model? It removes only 1 token of information out of 729 right?

Awesome that we could both improve from this. Thanks again for helping me increase my knowledge.

qtnx

qresearch org Apr 26

•

edited Apr 26

In regards to Q4, what effect do you think removing that has on the model? It removes only 1 token of information out of 729 right?

yes realistically barely any info is lost, but if a detail is in that certain patch we don't want to lose it haha

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment