Spaces:
Runtime error
Runtime error
import streamlit as st | |
from streamlit_extras.switch_page_button import switch_page | |
st.title("OWLv2") | |
st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1748411972675150040) (January 19, 2024)""", icon="βΉοΈ") | |
st.markdown(""" """) | |
st.markdown("""Explaining the π of zero-shot open-vocabulary object detection: OWLv2 π¦π§Ά""") | |
st.markdown(""" """) | |
st.image("pages/OWLv2/image_1.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first π | |
OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training π | |
What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together. | |
""") | |
st.markdown(""" """) | |
st.image("pages/OWLv2/image_2.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown("""Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP). | |
They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune. | |
""") | |
st.markdown(""" """) | |
st.image("pages/OWLv2/image_3.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown("""During fine-tuning for object detection, they calculate the loss over bipartite matches. | |
Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth. | |
OWL-ViT is very scalable. | |
One can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need supervision. | |
Moreover, only scaling the encoders creates a bottleneck after a while. | |
""") | |
st.markdown(""" """) | |
st.image("pages/OWLv2/image_1.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data. | |
""") | |
st.markdown(""" """) | |
st.image("pages/OWLv2/image_4.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Thanks to this, OWLv2 scaled very well and is tops leaderboards on open vocabulary object detection π | |
""") | |
st.markdown(""" """) | |
st.image("pages/OWLv2/image_5.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Want to try OWL models? | |
I've created a [notebook](https://t.co/ick5tA6nyx) for you to see how to use it with π€ Transformers. | |
If you want to play with it directly, you can use this [Space](https://t.co/oghdLOtoa5). | |
All the models and the applications of OWL-series is in this [collection](https://huggingface.co/collections/merve/owl-series-65aaac3114e6582c300544df). | |
""") | |
st.markdown(""" """) | |
st.info(""" | |
Ressources: | |
[Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) | |
by Matthias Minderer, Alexey Gritsenko, Neil Houlsby (2023) | |
[GitHub](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit) | |
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/owlv2)""", icon="π") | |
st.markdown(""" """) | |
st.markdown(""" """) | |
st.markdown(""" """) | |
col1, col2, col3 = st.columns(3) | |
with col1: | |
if st.button('Previous paper', use_container_width=True): | |
switch_page("SigLIP") | |
with col2: | |
if st.button('Home', use_container_width=True): | |
switch_page("Home") | |
with col3: | |
if st.button('Next paper', use_container_width=True): | |
switch_page("Backbone") |