--- license: other license_name: myvlm-snap-license license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE --- # MyVLM **Paper:** https://arxiv.org/abs/2403.14599 **Project Page:** https://snap-research.github.io/MyVLM/ **Code:** https://github.com/snap-research/MyVLM # MyVLM Concept Heads & Concept Embeddings As part of our [MyVLM code](https://github.com/snap-research/MyVLM) release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper. These can be loaded using the `CLIPConceptHead` class and our inference scripts for reproducing the paper results. This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training. ## Concept Heads

For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.

As mentioned in the paper, we have two types of concept heads: 1. A facial recognition model for recognizing individuals 2. A CLIP-based concept head for recognizing user-specific objects For faces, we use the `buffalo_l` face detection and face recognition model from [insightface](https://github.com/deepinsight/insightface/tree/master). See `concept_heads/face_recognition/head.py` for usage. For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (`DFN5B-CLIP-ViT-H-14-384`). See `concept_heads/clip/head.py` for usage. ## Concept Embeddings

Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.

The concept embeddings are saved as `.pt` files in the following format: ``` { 10: { "keys": torch.Tensor(), # the keys used for optimizing the concept embedding "values": torch.Tensor(), # the concept embedding itself }, ... 20: { "keys": torch.Tensor(), "values": torch.Tensor(), }, ... } ``` where each entry in the dictionary represents a different checkpoint during the optimization process. We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA. ## License This sample code is made available by Snap Inc. for non-commercial, academic purposes only. Please see the full license [here](https://github.com/snap-research/MyVLM/blob/master/LICENSE).