File size: 3,314 Bytes
94e735e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import streamlit as st
from streamlit_extras.switch_page_button import switch_page

st.title("SigLIP")

st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1745476609686089800) (January 11. 2024)""", icon="ℹ️")
st.markdown(""" """)

st.markdown("""SigLIP just got merged to 🤗 Transformers and it's super easy to use!  
To celebrate this, I have created a repository on various SigLIP based projects!  
But what is it and how does it work?  
SigLIP an vision-text pre-training technique based on contrastive learning. It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs.  
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. 📎
""")
st.markdown(""" """)

st.image("pages/SigLIP/image_1.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Highlights✨  
🖼️📝 Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder  
😍 More performant than CLIP on zero-shot  
🗣️ Authors trained a multilingual model too!  
⚡️ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k (see saturation on perf below)
""")
st.markdown(""" """)

st.image("pages/SigLIP/image_2.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
Below you can find prior CLIP models and SigLIP across different image encoder sizes and their performance on different datasets 👇🏻 
""")
st.markdown(""" """)

st.image("pages/SigLIP/image_3.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
With 🤗 Transformers integration there comes zero-shot-image-classification pipeline, makes SigLIP super easy to use! 
""")
st.markdown(""" """)

st.image("pages/SigLIP/image_4.jpg", use_column_width=True)
st.markdown(""" """)

st.markdown("""
What to use SigLIP for? 🧐  
Honestly the possibilities are endless, but you can use it for image/text retrieval, zero-shot classification, training multimodal models!  
I have made a repository with notebooks and applications that are also hosted on [Spaces](https://t.co/Ah1CrHVuPY).  
I have built ["Draw to Search Art"](https://t.co/DcmQWMc1qd) where you can input image (upload one or draw) and search among 10k images in wikiart!  
I've also built apps to [compare](https://t.co/m699TMvuW9) CLIP and SigLIP outputs.
""")
st.markdown(""" """)

st.image("pages/SigLIP/image_5.jpg", use_column_width=True)
st.markdown(""" """)

st.info("""
Ressources:  
[Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343)  
by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer (2023)  
[GitHub](https://github.com/google-research/big_vision)  
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/siglip)""", icon="📚")  
st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3 = st.columns(3)
with col1:
    if st.button('Previous paper', use_container_width=True):
        switch_page("DINOv2")
with col2:
    if st.button('Home', use_container_width=True):
        switch_page("Home")
with col3:
    if st.button('Next paper', use_container_width=True):
        switch_page("OWLv2")