Edit model card

GIT-base fine-tuned for Image Captioning on High-Level descriptions of Scenes

GIT base trained on the HL dataset for scene generation of images

Model fine-tuning πŸ‹οΈβ€

  • Trained for 10 epochs
  • lr: 5eβˆ’5
  • Adam optimizer
  • half-precision (fp16)

Test set metrics 🧾

| Cider  | SacreBLEU  | Rouge-L|
|--------|------------|--------|
| 103.00 |    24.67   |  33.90 |

Model in Action πŸš€

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("git-base-captioning-ft-hl-scenes")
model = AutoModelForCausalLM.from_pretrained("git-base-captioning-ft-hl-scenes").to("cuda")

img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl/--/default/train/0/image/image.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')


inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values

generated_ids = model.generate(pixel_values=pixel_values, max_length=50,
            do_sample=True,
            top_k=120,
            top_p=0.9,
            early_stopping=True,
            num_return_sequences=1)

processor.batch_decode(generated_ids, skip_special_tokens=True)

>>> "in a beach"

BibTex and citation info

@inproceedings{cafagna2023hl,
  title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
{R}ationales},
  author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
  booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
address = {Prague, Czech Republic},
  year={2023}
}
Downloads last month
16
Safetensors
Model size
177M params
Tensor type
I64
Β·
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train michelecafagna26/git-base-captioning-ft-hl-scenes