metadata

license: cc-by-nc-4.0
model-index:
  - name: CondViT-B16-cat
    results:
      - dataset:
          name: LAION - Referred Visual Search - Fashion
          split: test
          type: Slep/LAION-RVS-Fashion
        metrics:
          - name: R@1 +10K Dist.
            type: recall_at_1|10000
            value: 93.44 ± 0.83
          - name: R@5 +10K Dist.
            type: recall_at_5|10000
            value: 98.07 ± 0.37
          - name: R@10 +10K Dist.
            type: recall_at_10|10000
            value: 98.69 ± 0.38
          - name: R@20 +10K Dist.
            type: recall_at_20|10000
            value: 98.98 ± 0.34
          - name: R@50 +10K Dist.
            type: recall_at_50|10000
            value: 99.55 ± 0.18
          - name: R@1 +100K Dist.
            type: recall_at_1|100000
            value: 85.90 ± 1.37
          - name: R@5 +100K Dist.
            type: recall_at_5|100000
            value: 94.22 ± 0.87
          - name: R@10 +100K Dist.
            type: recall_at_10|100000
            value: 96.04 ± 0.68
          - name: R@20 +100K Dist.
            type: recall_at_20|100000
            value: 97.18 ± 0.56
          - name: R@50 +100K Dist.
            type: recall_at_50|100000
            value: 98.28 ± 0.34
          - name: R@1 +500K Dist.
            type: recall_at_1|500000
            value: 78.19 ± 1.59
          - name: R@5 +500K Dist.
            type: recall_at_5|500000
            value: 88.70 ± 1.15
          - name: R@10 +500K Dist.
            type: recall_at_10|500000
            value: 91.46 ± 1.02
          - name: R@20 +500K Dist.
            type: recall_at_20|500000
            value: 94.07 ± 0.86
          - name: R@50 +500K Dist.
            type: recall_at_50|500000
            value: 96.11 ± 0.64
          - name: R@1 +1M Dist.
            type: recall_at_1|1000000
            value: 74.49 ± 1.23
          - name: R@5 +1M Dist.
            type: recall_at_5|1000000
            value: 85.38 ± 1.29
          - name: R@10 +1M Dist.
            type: recall_at_10|1000000
            value: 88.95 ± 1.15
          - name: R@20 +1M Dist.
            type: recall_at_20|1000000
            value: 91.35 ± 0.93
          - name: R@50 +1M Dist.
            type: recall_at_50|1000000
            value: 94.75 ± 0.75
          - name: Available Dists.
            type: n_dists
            value: 2000014
          - name: Embedding Dimension
            type: embedding_dim
            value: 512
          - name: Conditioning
            type: conditioning
            value: category
        source:
          name: LRVSF Leaderboard
          url: https://huggingface.co/spaces/Slep/LRVSF-Leaderboard
        task:
          type: Retrieval
tags:
  - lrvsf-benchmark
datasets:
  - Slep/LAION-RVS-Fashion

Conditional ViT - B/16 - Categories

Introduced in LRVSF-Fashion: Extending Visual Search with Referring Instructions, Lepage et al. 2023

Data	Code	Models	Spaces
Full Dataset	Training Code	Categorical Model	LRVS-F Leaderboard
Test set	Benchmark Code	Textual Model	Demo

General Infos

Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning categories are the following :

Bags
Feet
Hands
Head
Lower Body
Neck
Outwear
Upper Body
Waist
Whole Body

Research use only.

How to Use

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("Slep/CondViT-B16-cat")
processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-cat")

url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg"
img = Image.open(requests.get(url, stream=True).raw)
cat = "Bags"

inputs = processor(images=[img], categories=[cat])
raw_embedding = model(**inputs)
normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1)