UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation
In Python, JavaScript, and Swift

The uform3-image-text-english-base UForm model is a tiny vision and English language encoder, mapping them into a shared vector space. This model produces up to 256-dimensional embeddings and is made of:

Text encoder: 4-layer BERT for up to 64 input tokens.
Visual encoder: ViT-B/16 for images of 224 x 224 resolution.

Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training. Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code. If you need a larger, more accurate, or multilingual model, check our HuggingFace Hub. For more details on running the model, check out the UForm GitHub repository.

Evaluation

On text-to-image retrieval it reaches 94% Recall@10 for Flickr:

Dataset	Recall@1	Recall@5	Recall@10
Zero-Shot Flickr	0.727	0.915	0.949
MS-COCO ¹	0.510	0.761	0.838

¹ It's important to note, that the MS-COCO train split was present in the training data.

Installation

pip install "uform[torch,onnx]"

Usage

To load the model:

from uform import get_model, Modality

import requests
from io import BytesIO
from PIL import Image

model_name = 'unum-cloud/uform3-image-text-english-base'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)

model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]

To encode the content:

text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image_url = Image.open(BytesIO(requests.get(image_url).content))

image_data = processor_image(image)
text_data = processor_text(text)
image_features, image_embedding = model_image.encode(image_data, return_features=True)
text_features, text_embedding = model_text.encode(text_data, return_features=True)

unum-cloud
/

uform3-image-text-english-base

UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation
In Python, JavaScript, and Swift

Evaluation

Installation

Usage

Dataset used to train unum-cloud/uform3-image-text-english-base

Collection including unum-cloud/uform3-image-text-english-base

UForm 3 Encoders

UForm

Pocket-Sized Multimodal AI For Content Understanding and Generation In Python, JavaScript, and Swift

Evaluation

Installation

Usage

Dataset used to train unum-cloud/uform3-image-text-english-base

Collection including unum-cloud/uform3-image-text-english-base

Pocket-Sized Multimodal AI
For Content Understanding and Generation
In Python, JavaScript, and Swift