--- language: - en library_name: transformers base_model: google/paligemma-3b-pt-224 pipeline_tag: visual-question-answering inference: false tags: - paligemma - coffe - caption license: mit --- # Model Card for Model ID Google's Paligemma VLM (Vision Language Model) finetuned to provide captions to coffe machine images ### Model Description - **Developed by:** Komorebi AI - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model :** google/paligemma-3b-pt-224 - **Demo :** https://huggingface.co/spaces/Fer14/coffe_machine_caption ## Usage ```python from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor from PIL import Image model_id = "Fer14/paligemma_coffee_machine_caption" model = PaliGemmaForConditionalGeneration.from_pretrained(model_id) processor = PaliGemmaProcessor.from_pretrained(model_id) image = Image.open("path to your image").convert("RGB") prompt = ( f"Generate a caption for the following coffee maker image. The caption has to be of the following structure:\n" "\"A , , shaped, with and butons\"\n\n" "in which:\n" "- color: red, black, blue...\n" "- type: coffee machine, coffee maker, espresso coffee machine...\n" "- accessories: a list of accessories like the ones described above\n" "- shape: cubed, round...\n" "- screen: screen, no screen.\n" "- number: amount of buttons to add\n" "- b_color: color of the buttons" ) inputs = processor( text=prompt, images=image, return_tensors="pt", padding="longest", ) output = model.generate(**inputs, max_length=1000) decoded_output = processor.decode(output[0], skip_special_tokens=True)[len(prompt) :] ``` ### Framework versions - PEFT 0.11.1 - Transformers 4.41.2