Spaces:
Runtime error
Runtime error
import streamlit as st | |
from streamlit_extras.switch_page_button import switch_page | |
st.title("VITMAE") | |
st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1740688304784183664) (December 29, 2023)""", icon="ℹ️") | |
st.markdown(""" """) | |
st.markdown("""Just read VitMAE paper, sharing some highlights 🧶 | |
ViTMAE is a simply yet effective self-supervised pre-training technique, where authors combined vision transformer with masked autoencoder. | |
The images are first masked (75 percent of the image!) and then the model tries to learn about the features through trying to reconstruct the original image! | |
""") | |
st.markdown(""" """) | |
st.image("pages/VITMAE/image_1.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown("""The image is not masked, but rather only the visible patches are fed to the encoder (and that is the only thing encoder sees!). | |
Next, a mask token is added to where the masked patches are (a bit like BERT, if you will) and the mask tokens and encoded patches are fed to decoder. | |
The decoder then tries to reconstruct the original image. | |
""") | |
st.markdown(""" """) | |
st.image("pages/VITMAE/image_2.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown("""As a result, the authors found out that high masking ratio works well in fine-tuning for downstream tasks and linear probing 🤯🤯 | |
""") | |
st.markdown(""" """) | |
st.image("pages/VITMAE/image_3.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown("""If you want to try the model or fine-tune, all the pre-trained VITMAE models released released by Meta are available on [Huggingface](https://t.co/didvTL9Zkm). | |
We've built a [demo](https://t.co/PkuACJiKrB) for you to see the intermediate outputs and reconstruction by VITMAE. | |
Also there's a nice [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb) by [@NielsRogge](https://twitter.com/NielsRogge). | |
""") | |
st.markdown(""" """) | |
st.image("pages/VITMAE/image_4.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.info(""" | |
Ressources: | |
[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377v3) | |
by LKaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick (2021) | |
[GitHub](https://github.com/facebookresearch/mae) | |
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/vit_mae)""", icon="📚") | |
st.markdown(""" """) | |
st.markdown(""" """) | |
st.markdown(""" """) | |
col1, col2, col3 = st.columns(3) | |
with col1: | |
if st.button('Previous paper', use_container_width=True): | |
switch_page("OneFormer") | |
with col2: | |
if st.button('Home', use_container_width=True): | |
switch_page("Home") | |
with col3: | |
if st.button('Next paper', use_container_width=True): | |
switch_page("DINOV2") |