Searchium-ai
/

clip4clip-webvid150k

zero-shot-image-classification

Inference Endpoints

Model card Files Files and versions Community

clip4clip-webvid150k / README.md

Diangle's picture

Update README.md

827d745 over 1 year ago

|

3.09 kB

	---
	tags:
	- vision
	- clip
	- clip4clip
	- video
	- retrieval
	pipeline_tag: text-to-video
	---

	# Model Card
	## Details
	This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).

	The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.

	To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers.

	### Use with Transformers

	```python
	import numpy as np
	import torch
	from transformers import AutoTokenizer, CLIPTextModelWithProjection


	search_sentence = "a basketball player performing a slam dunk"

	model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid")
	tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid")

	inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True)
	outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False)

	# Special projection and changing last layers:
	text_projection = model.state_dict()['text_projection.weight']
	text_embeds = outputs[1] @ text_projection
	final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)]

	# Normalizing the embeddings:
	final_output = final_output / final_output.norm(dim=-1, keepdim=True)
	final_output = final_output.cpu().detach().numpy()
	sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True)
	print("sequence_output: ", sequence_output)
	```

	## Model Use

	### Intended Use

	This model is intended to use for video retrival, look for example this [space](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).

	### Extra Information

	For video embedding there is an extra [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_EmbedVideos.ipynb) that describes how to embed videos.


	## Performance and Limitations

	### Performance

	We have evaluated the performance of differnet models on the last 10k video clips from Webvid database.

	\| Model \| R1 \| R5 \| R10 \| MedianR \| MeanR
	\|------------------------\|-------\|-------\|-------\|-----\|---------\|
	\| Zero-shot clip weights \| 37.16 \| 62.10 \| 71.16 \| 3.0 \| 42.2128
	\| CLIP4Clip weights trained on msr-vtt \| 38.38 \| 62.89 \| 72.01 \| 3.0 \|39.3023
	\| CLIP4Clip trained on 150k Webvid \| 50.74 \| 77.30 \| 85.05 \| 1.0 \| 14.9535
	\| Binarized CLIP4Clip trained on 150k Webvid with rerank100 \| 50.56 \| 76.39 \| 83.51 \| 1.0 \| 43.2964

	For more information about the evaluation you can look at this [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval-Evaluation.ipynb).