metadata

language: en
license: mit
arxiv: 2309.02561
tags:
  - vision
  - image-captioning
pipeline_tag: image-to-text

PG-InstructBLIP model

Finetuned version of InstructBLIP with Flan-T5-XXL as the language model. PG-InstructBLIP was introduced in the paper Physically Grounded Vision-Language Models for Robotic Manipulation by Gao et al (arxiv).

Model description

PG-InstructBLIP is finetuned using the PhysObjects dataset, an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. This fine-tuning improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance.

Example Usage and Installation

This model is designed to be used with the LAVIS library. Please install salesforce-lavis from source and download this model through git-lfs or direct downloading.

After loading the model, you can disable the qformer text input to follow the same configuration we used for fine-tuning. However, the model still works well with it enabled, so we recommend users to experiment with both and choose the optimal configuration on a case-by-case basis.

Review the generate.py and test.py scripts provided in the Files tab for an example of using PG-InstructBLIP to determine the transparency of an opaque bowl.

import torch
from PIL import Image
from omegaconf import OmegaConf

from lavis.models import load_model, load_preprocess
from lavis.common.registry import registry

import requests

from generate import generate

url = "https://iliad.stanford.edu/pg-vlm/example_images/ceramic_bowl.jpg"
example_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

vlm = load_model(
    name='blip2_t5_instruct',
    model_type='flant5xxl',
    checkpoint='pgvlm_weights.bin',  # replace with location of downloaded weights
    is_eval=True,
    device="cuda" if torch.cuda.is_available() else "cpu"
)

vlm.qformer_text_input = False  # Optionally disable qformer text

model_cls = registry.get_model_class('blip2_t5_instruct')
model_type = 'flant5xxl'
preprocess_cfg = OmegaConf.load(model_cls.default_config_path(model_type)).preprocess
vis_processors, _ = load_preprocess(preprocess_cfg)
processor = vis_processors["eval"]

question_samples = {
    'prompt': 'Question: Classify this object as transparent, translucent, or opaque? Respond unknown if you are not sure. Short answer:',
    'image': torch.stack([processor(example_image)], dim=0).to(vlm.device)
}

answers, scores = generate(vlm, question_samples, length_penalty=0, repetition_penalty=1, num_captions=3)
print(answers, scores)
# ['opaque', 'translucent', 'transparent'] tensor([-0.0373, -4.2404, -4.4436], device='cuda:0')

Note that the output of the generate function includes the log probabilities of each generation. For categorical properties (like material, transparency, and contents), these probabilities can be interpreted as confidences, as typical with VLMs. In the example above, PG-InstructBLIP is very confident that the ceramic bowl is opaque, which is true.

For continuous properties (like mass, fragility, and deformability), we recommend asking yes or no questions like "Is this object heavy?" and comparing the probabilities of the "yes" response between objects to determine which has a larger value.

For best results, we also recommend cropping input images to focus on the object in question, because PG-InstructBLIP is fine-tuned on object-centric data.