107 49 74

Hugo Laurençon

HugoLaurencon

HugoLaurencon

AI & ML interests

None yet

Articles

Docmatix - a huge dataset for Document Visual Question Answering

Jul 18

• 67

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Apr 15

• 166

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Mar 15

• 6

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 27

Putting ethical principles at the core of research lifecycle

May 19, 2022

Organizations

HugoLaurencon's activity

reacted to alex-abb's post with 🔥 5 months ago

Post

4768

Hi everyone!
I'm Alex, I'm 16, I've been an internship at Hugging Face for a little over a week and I've already learned a lot about using and prompting LLM models. With @victor as tutor I've just finished a space that analyzes your feelings by prompting an LLM chat model. The aim is to extend it so that it can categorize hugging face posts.

alex-abb/LLM_Feeling_Analyzer

4 replies

replied to their post 7 months ago

It was roughly trained for 1 month on 32 nodes of 8 H100s

posted an update 7 months ago

Post

2842

We release Idefics2-chatty, the chatbot-optimized version of Idefics2: HuggingFaceM4/idefics2-8b-chatty

Idefics2-chatty is better at following instructions and following Chain-of-Thoughts reasoning.

Moreover, we also release a paper, containing a lot of findings on how to build an efficient and performant Vision-Language Model: What matters when building vision-language models? (2405.02246)

How are you going to use the model, or what data are you going to fine-tune it on?

5 replies

reacted to VictorSanh's post with 🔥 7 months ago

Post

2762

💬🔥Releasing idefics2-8b-chatty, the chat-optimized version of Idefics2!

It is a very efficient (8B parameters) state-of-the-art VLM, has been red-teamed, and comes with a few surprises:
- 📖Paper dissecting a lot of the experimental insights we learned building Idefics2:
- 🏎️TGI integration for blazing-fast inference (you can already run it locally with < 24GB GPU memory)
- 🏆 Ranking 2nd in its category (< 10B, open weights) in the awesome Open VLM Leaderboard, and now appearing in the incredible Vision Arena

Ressources:
⏯️Playground: HuggingFaceM4/idefics2_playground
📖Paper: What matters when building vision-language models? (2405.02246)
🏋️‍♂️Model and red-teaming analysis: HuggingFaceM4/idefics2-8b-chatty
👀Ressources to get started: HuggingFaceM4/idefics2-8b-chatty
🏆Open VLM Leaderboard: opencompass/open_vlm_leaderboard
🏟️Vision arena: WildVision/vision-arena

1 reply

posted an update 7 months ago

Post

2462

Idefics2 is trained mostly on OBELICS, our open interleaved image-text document dataset.

Training on interleaved data is crucial to reaching high performance on VQA tasks, taking an arbitrary number of images as input, and doing in-context learning.

Dataset: HuggingFaceM4/OBELICS
Nomic visualization: https://atlas.nomic.ai/map/f2fba2aa-3647-4f49-a0f3-9347daeee499/ee4a84bd-f125-4bcc-a683-1b4e231cb10f
Link to OBELICS thread: https://twitter.com/HugoLaurencon/status/1694005892839006301

posted an update 7 months ago

Post

2806

The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.

The Cauldron covers a wide range of tasks, including general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a code.

HuggingFaceM4/the_cauldron

posted an update 7 months ago

Post

3082

We release Idefics2-8B, a foundation vision language model with SOTA results for its size on many benchmarks.

For Idefics2, we adopted a simple architecture:
-Images are fed to a vision encoder, then to a modality projection to match the input dimension of the LLM, and finally to a perceiver resampler for efficient pooling.
-Interleaved image-text data are then passed to the LLM.

During the pre-training:
-The modality projection and perceiver resampler weights are newly initialized.
-We start with pre-trained models for the vision encoder and the LLM, and continue the training with LoRA.
-In total, we see 1.5T images!

We pre-train on 3 types of data, all publicly available:
-Interleaved image-text documents: our dataset OBELICS HuggingFaceM4/OBELICS
-Image caption pairs: only synthetic captions!
-PDF documents: IDL and PDFA

We kept the aspect ratio of the images with the Patch n' Pack strategy, with a resolution of up to 980x980.
At inference, it's also more efficient for lower-resolution images.

For the SFT, we build The Cauldron, a collection of 50 high-quality datasets in the user/assistant format.
It is a ready-to-use dataset for the fine-tuning of any VLM.
HuggingFaceM4/the_cauldron

Most current models, like LLaVA-NeXT, encode images with an excessive number of tokens, like 2880.
Instead, we put a focus on being efficient at inference by training on a mix of images encoded with 64 tokens, and 320 tokens.
The result is that we perform favorably compared to the best models in our size class, while being efficient at inference.

replied to their post 8 months ago

Thanks for the feedback. If flash attention is a problem you can always enable/disable it in the loading of the model

We will publish all the details on how the foundation model was trained during its release!

posted an update 8 months ago

Post

With the new WebSight dataset, converting the screenshot of a web page to its corresponding HTML code is just one fine-tuning step away

We release a new version of our synthetic dataset:
-Real images within web pages 🖼️
-Tailwind CSS 🎨
-2M examples 📈

Our initial release, v0.1, featured web designs in HTML + CSS, using simple colored rectangles as image placeholders.
It was a good start to help models grasp the basics of web page structure and coding associations.
Yet, it was missing the look of a real website.

Improving visual appeal, we've now embedded actual images in our web designs, ensuring they match the site's content for a more authentic look.

Switching to Tailwind CSS offers a more compact representation of the code.

We've also expanded our dataset to 2 million examples!

After fine-tuning our forthcoming foundation vision-language model on this dataset, we've observed some encouraging capabilities, such as converting sketches directly into functional HTML code.

We're excited to hear your thoughts and suggestions for future versions. What would you like to see next? Feel free to open a discussion on the hub!

Dataset: HuggingFaceM4/WebSight
Technical report: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
Blog post: https://huggingface.co/blog/websight
Google Colab: https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing

Work done with @VictorSanh @Leyo

4 replies

reacted to VictorSanh's post with 🔥 8 months ago

Post

When Greg Brockman demo-ed GPT4 by hand-sketching a joke website on a piece of paper and asking the system to convert that into an HTML webpage, it blew my mind.

Can you build your own Screenshot-to-HTML system with much fewer resources?

With this new resource, most likely yes! Current vision-language models can learn this task with the right data (and the right tricks).

We have iterated on WebSight-v0.1 and are releasing its v0.2.
WebSight is an open dataset of synthetically generated webpages with their corresponding rendered screenshot.

A few noticeable improvements:
- 💨From traditional CSS to Tailwind CSS. Tailwind is CSS directly embedded in the HTML attribute class and is much more compact
- 🚛2M pairs of synthetic HTML webpages with their associated rendered screenshot, along with the prompt generated by an LLM to create that webpage
- 🖼️Much more visually appealing pages with the integration of real images

👀Blog: https://huggingface.co/blog/websight
💽Dataset: HuggingFaceM4/WebSight
📜Technical report: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
🎮Want to create your own synthetic data pipelines? A starting point: https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing

Built with @HugoLaurencon & @Leyo

reacted to clem's post with 🤗 10 months ago

Post

Is synthetic data the future of AI? 🔥🔥🔥

@HugoLaurencon @Leyo & @VictorSanh are introducing HuggingFaceM4/WebSight , a multimodal dataset featuring 823,000 pairs of synthetically generated HTML/CSS codes along with screenshots of the corresponding rendered websites to train GPT4-V-like models 🌐💻

While crafting their upcoming foundation vision language model, they faced the challenge of converting website screenshots into usable HTML/CSS codes. Most VLMs suck at this and there was no public dataset available for this specific task, so they decided to create their own.

They prompted existing LLMs to generate 823k HTML/CSS codes of very simple websites. Through supervised fine-tuning of a vision language model on WebSight, they were able to generate the code to reproduce a website component, given a screenshot.

You can explore the dataset here: HuggingFaceM4/WebSight

What do you think?

12 replies