Spaces:
Runtime error
Runtime error
import streamlit as st | |
from streamlit_extras.switch_page_button import switch_page | |
st.title("Chameleon") | |
st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1814278511785312320) (July 19, 2024)""", icon="ℹ️") | |
st.markdown(""" """) | |
st.markdown("""Chameleon 🦎 by Meta is now available in 🤗 Transformers. | |
A multimodal model that comes in 7B and 34B sizes 🤩 | |
But what makes this model so special? Keep reading ⇣ | |
""") | |
st.markdown(""" """) | |
st.video("pages/Chameleon/video_1.mp4", format="video/mp4") | |
st.markdown(""" """) | |
st.markdown(""" | |
[Demo](https://t.co/GsGE17fSdI) | [Models](https://t.co/cWUiVbsRz6) | |
Find below the API to load this model locally use it ⬇️ | |
""") | |
st.markdown(""" """) | |
st.image("pages/Chameleon/image_1.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown("""Chameleon is a unique model: it attempts to scale early fusion 🤨 | |
But what is early fusion? | |
Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder.""") | |
st.markdown(""" """) | |
st.image("pages/Chameleon/image_2.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation 😏 | |
""") | |
st.markdown(""" """) | |
st.image("pages/Chameleon/image_3.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training. | |
This way they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO) . | |
""") | |
st.markdown(""" """) | |
st.image("pages/Chameleon/image_4.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use. | |
""") | |
st.markdown(""" """) | |
st.image("pages/Chameleon/image_5.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
One can also do text-only prompting, authors noted the model catches up with larger LLMs, and you can also see how it compares to VLMs with image-text prompting. | |
""") | |
st.markdown(""" """) | |
st.image("pages/Chameleon/image_6.jpg", use_column_width=True) | |
st.image("pages/Chameleon/image_6.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.info(""" | |
Resources: | |
- [Chameleon: Mixed-Modal Early-Fusion Foundation Models](https://arxiv.org/abs/2405.09818) | |
by Chameleon Team (2024) | |
- [GitHub](https://github.com/facebookresearch/chameleon) | |
- [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/chameleon) | |
- [Demo](https://huggingface.co/spaces/merve/chameleon-7b) | |
""", icon="📚") | |
st.markdown(""" """) | |
st.markdown(""" """) | |
st.markdown(""" """) | |
col1, col2, col3 = st.columns(3) | |
with col1: | |
if st.button('Previous paper', use_container_width=True): | |
switch_page("Llava-NeXT-Interleave") | |
with col2: | |
if st.button('Home', use_container_width=True): | |
switch_page("Home") | |
with col3: | |
if st.button('Next paper', use_container_width=True): | |
switch_page("Video-LLaVA") |