bunch of questions
Hi, since loading is done in 4 bits anyways, wouldn't it be best to change the model id to unsloth/llama-3-8b-Instruct-bnb-4bit
rather than unsloth/llama-3-8b-Instruct
? ( I've tried this now on my system, works fine with just a name change. Only note is that it uses bfloat16
and transformers
prioritises it over the bnb_config
we have so no float16
)
Also sorry for nitpicking, was seeing the torch.inference_mode()
docs and it might be best to use the decorator form @torch.inference_mode()
on the function.
Question: Why do you add 2 image_token_index
's if a bos
token is found (in tokenizer_image_token
)?
Question: Not super familiar with python packaging naming, is main so that if it was packaged, its the first thing to be executed?
Question: Why use the second last hidden layer (which is the 3rd last overall layer I think?), instead of the last layer?
Question: Also why is a single column taken away from the image_features?
Question: I'd be really interested to learn how the training was done. Even in the AnyMAL paper, at least a LoRA is needed to pass on the learnt multimodal relation, while this release is just the projection layers right?
https://huggingface.co/blog/vlms
https://huggingface.co/blog/AviSoori1x/seemore-vision-language-model
Answer: Projection is trained alone with siglip and llama 3 frozen. Finetuning isn't done because then the llm is unfrozen and it'd have to be included right?
Question: If I'm correct, final image embedding take 4096 tokens instead of 1152 tokens right?
Finally, I'd like to thank you for such a powerful release in literally 40MB. My 500gb ssd thanks you.
I understand this is a lot to process, please feel free to answer whichever questions you want at whatever time you please, and thanks again.
thanks for writing this:
- for the nitpicks, you're correct and we'll fix those, but since this is an alpha we might keep a full bf16 model as we expect current users to be able to modify the quantization config or switch model (to maybe test finetunes of llama later, we'll see)
- Q1: we'll have to check but normally this shouldn't be the case
- Q2: yes
__main__
is just naming if it were in a module but doesn't matter much for now - Q3: this is due to testing in different papers (i think llava but i'm not sure, will have to check) finding that using the penultimate layer of an encoder is better
- Q4: this works for a CLIP model, as we want to remove the [CLS] token, but this is an oversight for SigLIP: we forgot to remove it so thanks for finding that, don't ship at 2AM!
- Q5: the final embedding dimension size of the image features is 4096 as it is mapped to llama's embedding space
also your own answer is correct, this is just pretraining not pretraining+finetuning
Thanks so much for taking the time, one last question and I'm done haha.
In regards to Q4, what effect do you think removing that has on the model? It removes only 1 token of information out of 729 right?
Awesome that we could both improve from this. Thanks again for helping me increase my knowledge.
In regards to Q4, what effect do you think removing that has on the model? It removes only 1 token of information out of 729 right?
yes realistically barely any info is lost, but if a detail is in that certain patch we don't want to lose it haha