Marqo
/

marqo-fashionCLIP

Zero-Shot Image Classification

Transformers.js

multimodal retrieval

Model card Files Files and versions Community

marqo-fashionCLIP / README.md

DavidJung's picture

Update README.md

df7f643 verified 3 months ago

|

3.88 kB

	---
	tags:
	- clip
	- e-commerce
	- fashion
	- multimodal retrieval
	library_name: open_clip
	pipeline_tag: zero-shot-image-classification
	license: apache-2.0
	language:
	- en
	metrics:
	- precision
	- recall
	- MRR
	---
	# Marqo-FashionCLIP Model Card
	Marqo-FashionCLIP leverages Generalised Contrastive Learning ([GCL](https://www.marqo.ai/blog/generalized-contrastive-learning-for-multi-modal-retrieval-and-ranking)) which allows the model to be trained on not just text descriptions but also categories, style, colors, materials, keywords and fine-details to provide highly relevant search results on fashion products.
	The model was fine-tuned from ViT-B-16 (laion2b_s34b_b88k).

	Github Page: [Marqo-FashionCLIP](https://github.com/marqo-ai/marqo-FashionCLIP)

	Blog: [Marqo Blog](https://www.marqo.ai/blog/search-model-for-fashion)

	## Usage
	The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by

	```python
	import open_clip
	model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
	tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')

	import torch
	from PIL import Image

	image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
	text = tokenizer(["a hat", "a t-shirt", "shoes"])

	with torch.no_grad(), torch.cuda.amp.autocast():
	image_features = model.encode_image(image)
	text_features = model.encode_text(text)
	image_features /= image_features.norm(dim=-1, keepdim=True)
	text_features /= text_features.norm(dim=-1, keepdim=True)

	text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

	print("Label probs:", text_probs)

	```

	## Benchmark Results
	Average evaluation results on 6 public multimodal fashion datasets ([Atlas](https://huggingface.co/datasets/Marqo/atlas), [DeepFashion (In-shop)](https://huggingface.co/datasets/Marqo/deepfashion-inshop), [DeepFashion (Multimodal)](https://huggingface.co/datasets/Marqo/deepfashion-multimodal), [Fashion200k](https://huggingface.co/datasets/Marqo/fashion200k), [KAGL](https://huggingface.co/datasets/Marqo/KAGL), and [Polyvore](https://huggingface.co/datasets/Marqo/polyvore)) are reported below:

	Text-To-Image (Averaged across 6 datasets)
	\| Model \| AvgRecall \| Recall@1 \| Recall@10 \| MRR \|
	\|----------------------------\|-------------\|------------\|-------------\|-----------\|
	\| Marqo-FashionCLIP \| 0.192 \| 0.094 \| 0.290 \| 0.200 \|
	\| FashionCLIP2.0 \| 0.163 \| 0.077 \| 0.249 \| 0.165 \|
	\| OpenFashionCLIP \| 0.132 \| 0.060 \| 0.204 \| 0.135 \|
	\| ViT-B-16-laion2b_s34b_b88k \| 0.174 \| 0.088 \| 0.261 \| 0.180 \|

	Category-To-Product (Averaged across 5 datasets)
	\| Model \| AvgP \| P@1 \| P@10 \| MRR \|
	\|----------------------------\|-----------\|-----------\|-----------\|-----------\|
	\| Marqo-FashionCLIP \| 0.705 \| 0.734 \| 0.676 \| 0.776 \|
	\| FashionCLIP2.0 \| 0.684 \| 0.681 \| 0.686 \| 0.741 \|
	\| OpenFashionCLIP \| 0.646 \| 0.653 \| 0.639 \| 0.720 \|
	\| ViT-B-16-laion2b_s34b_b88k \| 0.662 \| 0.673 \| 0.652 \| 0.743 \|

	Sub-Category-To-Product (Averaged across 4 datasets)
	\| Model \| AvgP \| P@1 \| P@10 \| MRR \|
	\|----------------------------\|-----------\|-----------\|-----------\|-----------\|
	\| Marqo-FashionCLIP \| 0.707 \| 0.747 \| 0.667 \| 0.772 \|
	\| FashionCLIP2.0 \| 0.657 \| 0.676 \| 0.638 \| 0.733 \|
	\| OpenFashionCLIP \| 0.598 \| 0.619 \| 0.578 \| 0.689 \|
	\| ViT-B-16-laion2b_s34b_b88k \| 0.638 \| 0.651 \| 0.624 \| 0.712 \|