sentence-transformers
/

clip-ViT-B-32

Sentence Similarity

sentence-transformers

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

clip-ViT-B-32 / README.md

tomaarsen's picture

tomaarsen HF staff

Update Sentence Transformers metadata (#5)

11fb331 verified 10 months ago

|

history blame contribute delete

1.91 kB

	---
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	pipeline_tag: sentence-similarity
	---

	# clip-ViT-B-32

	This is the Image & Text model [CLIP](https://arxiv.org/abs/2103.00020), which maps text and images to a shared vector space. For applications of the models, have a look in our documentation [SBERT.net - Image Search](https://www.sbert.net/examples/applications/image-search/README.html)

	## Usage

	After installing [sentence-transformers](https://sbert.net) (`pip install sentence-transformers`), the usage of this model is easy:


	```python
	from sentence_transformers import SentenceTransformer, util
	from PIL import Image

	#Load CLIP model
	model = SentenceTransformer('clip-ViT-B-32')

	#Encode an image:
	img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

	#Encode text descriptions
	text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

	#Compute cosine similarities
	cos_scores = util.cos_sim(img_emb, text_emb)
	print(cos_scores)
	```

	See our [SBERT.net - Image Search](https://www.sbert.net/examples/applications/image-search/README.html) documentation for more examples how the model can be used for image search, zero-shot image classification, image clustering and image deduplication.

	## Performance

	In the following table we find the zero-shot ImageNet validation set accuracy:

	\| Model \| Top 1 Performance \|
	\| --- \| :---: \|
	\| [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) \| 63.3 \|
	\| [clip-ViT-B-16](https://huggingface.co/sentence-transformers/clip-ViT-B-16) \| 68.1 \|
	\| [clip-ViT-L-14](https://huggingface.co/sentence-transformers/clip-ViT-L-14) \| 75.4 \|

	For a multilingual version of the CLIP model for 50+ languages have a look at: [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1)