About extracting embedding vectors of images and texts.

#10

by iceleaf97tech - opened Jun 11

Jun 11

Is it possible to extract embedding vectors of images and texts using these models?
If so, how should I do that?
Can you provide the template of codes? thx

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jun 17

Multimodal visual VQA models are not recommended for embedding extraction:

VQA models are primarily designed for visual question-answering tasks, with architectures and optimization goals that differ from embedding extraction.
The CLIP model is specifically trained for aligned embeddings of images and text, providing better performance and greater adaptability.

Using CLIP for embedding extraction is more efficient and better suited to the practical requirements of embedding tasks.

zRzRzRzRzRzRzR changed discussion status to closed Jun 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment