217 61 184

Victor Sanh PRO

VictorSanh

AI & ML interests

None yet

Recent Activity

New activity about 2 months ago

shuaishuaicdp/GUI-World:keyframes in `android.jsonl`?

liked a dataset about 2 months ago

agent-studio/GroundUI-18K

View all activity

Articles

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Apr 15

• 166

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Mar 15

• 6

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 27

Organizations

VictorSanh's activity

Reacted to Abhaykoul's post with 🔥 3 months ago

Post

2756

Introducing HelpingAI2-9B, an emotionally intelligent LLM.
Model Link : OEvortex/HelpingAI2-9B
Demo Link: Abhaykoul/HelpingAI2

This model is part of the innovative HelpingAI series and it stands out for its ability to engage users with emotional understanding.

Key Features:
-----------------

* It gets 95.89 score on EQ Bench greather than all top notch LLMs, reflecting advanced emotional recognition.
* It gives responses in empathetic and supportive manner.

Must try our demo: Abhaykoul/HelpingAI2

Reacted to joylarkin's post with 🚀🔥 3 months ago

Post

3006

Introducing Fineweb-Edu-Fortified: An enhanced Fineweb-Edu dataset. 📚

This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models. The dataset is based on the Fineweb-Edu subset of the large Fineweb dataset and includes:

- Exact-match deduplication across all crawls
- Embeddings for each row using the TaylorAI/bge-micro model
- Count column indicating duplication frequency
- Includes data from 95 Common Crawl crawls (2013-2024)
- Rows have been reduced from 1.279B to 0.324B after deduplication
- It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)

Access the entire Fineweb-Edu-Fortified dataset on Hugging Face → airtrain-ai/fineweb-edu-fortified

Try a semantic search demo via this Hugging Face Space → airtrain-ai/fineweb-edu-fortified-search-demo

Many thanks to the amazing @josh-sematic for his work on this project, the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets and for their support during our work on Fineweb-Edu-Fortified, and also thanks to @underspirit for pointing out the reduction in dataset size that could be achieved via deduplication. 🤗

Reacted to dvilasuero's post with 🤗❤️🚀🔥 5 months ago

Post

7948

Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!

28 replies

Reacted to lunarflu's post with ❤️ 6 months ago

Post

1903

cooking up something....anyone interested in a daily activity tracker for HF?

12 replies

Reacted to fdaudens's post with 🤗 6 months ago

Post

1109

How can AI help us write better headlines and reach more people?

I experimented with a new approach that is both useful and fun. It can help you overcome writer’s block, find better headlines, and make your blog posts and news articles climb in search engine results. Plus, we will learn new concepts along the way!

1️⃣ First, I scraped all the blog posts written on Hugging Face to create a dataset with the headlines, texts, dates, and authors' names.

2️⃣ I filtered the dataset to remove posts that were too long and would require a model with a longer context window. This was done to keep the project simple and cost-effective (actually, free).

3️⃣ Then, I used a dataset generation workflow built by @davanstrien to generate a DPO dataset.

4️⃣ As a last step, you can collectively rate these evaluations to improve the quality of the dataset using an easy-to-use interface with Argilla. Take a look at it and rate some of them! This way, you can contribute to making this dataset useful for different newsrooms that could use it as a starting point.

𝐖𝐡𝐲 𝐢𝐭 𝐦𝐚𝐭𝐭𝐞𝐫𝐬. This example is compelling because, if you look at the dataset, you can see some examples where the headlines are enhanced by the addition of an important keyword or an action verb.
These tweaks can have a big impact on your position in search engines and, therefore, on your traffic. It’s also good leverage for our creativity since you can compare the initial idea with another one from an outside perspective.

Imagine if you’re a large news organization; you could run this experiment with thousands of news articles.

With a dataset of several hundred to thousands of entries, you could fine-tune a model to suggest headlines better tailored to your needs and writing style.

👉 Take a look at it and rate the headlines fdaudens/journalism-argilla-space
👉 Daniel's code https://github.com/huggingface/data-is-better-together/blob/main/dpo/README.md

1 reply

Reacted to albertvillanova's post with 😎👍 7 months ago

Post

4059

Recently, the Hugging Face 🤗 datasets team met with the Language Technologies team led by Marta Villegas ( @mvillegas ) at Barcelona Supercomputing Center @BSC-LT . Eager to collaborate to promote AI across Catalan, Spanish, Basque, and Galician languages and share open-source datasets/models. 🤝 #AI #LanguageTech #OpenSource

1 reply

Reacted to HugoLaurencon's post with 🚀❤️ 7 months ago

Post

2846

We release Idefics2-chatty, the chatbot-optimized version of Idefics2: HuggingFaceM4/idefics2-8b-chatty

Idefics2-chatty is better at following instructions and following Chain-of-Thoughts reasoning.

Moreover, we also release a paper, containing a lot of findings on how to build an efficient and performant Vision-Language Model: What matters when building vision-language models? (2405.02246)

How are you going to use the model, or what data are you going to fine-tune it on?

5 replies

posted an update 7 months ago

Post

2768

💬🔥Releasing idefics2-8b-chatty, the chat-optimized version of Idefics2!

It is a very efficient (8B parameters) state-of-the-art VLM, has been red-teamed, and comes with a few surprises:
- 📖Paper dissecting a lot of the experimental insights we learned building Idefics2:
- 🏎️TGI integration for blazing-fast inference (you can already run it locally with < 24GB GPU memory)
- 🏆 Ranking 2nd in its category (< 10B, open weights) in the awesome Open VLM Leaderboard, and now appearing in the incredible Vision Arena

Ressources:
⏯️Playground: HuggingFaceM4/idefics2_playground
📖Paper: What matters when building vision-language models? (2405.02246)
🏋️‍♂️Model and red-teaming analysis: HuggingFaceM4/idefics2-8b-chatty
👀Ressources to get started: HuggingFaceM4/idefics2-8b-chatty
🏆Open VLM Leaderboard: opencompass/open_vlm_leaderboard
🏟️Vision arena: WildVision/vision-arena

1 reply

Reacted to jeffboudier's post with 🚀 7 months ago

Post

1682

TGI v2.0.2 is out!
- New models (idefics2, phi3)
- Cleaner VLM support in the openai layer
- Upgraded to pytorch 2.3.0

https://github.com/huggingface/text-generation-inference/releases/tag/v2.0.2

Kudos @Narsil @olivierdehaene @drbh and so many contributors!

Reacted to Pclanglais's post with 🔥 7 months ago

Post

2326

Announcing that we are on our way to solve a long standing issue of document processing: correction of OCR mistakes. Pleias publishes the largest dataset to date with automated OCR correction, 1 billion words in English, French, German and Italian.

OCR quality is long-standing issue of digitization. Cultural heritage texts are especially concerned due to the primary sources being old documents (with many artifacts, blots, degradation) and to the limitation of OCR technology for historical scripts. When we released Common Corpus, a 500 Billion words corpus in the public domain, this was the primary criticism.

Recent breakthrough in post-OCR correction has been made possible thanks to progress in open LLM research and several months of dedicated training and alignment by Pleias as well as the HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.

Announcement: https://huggingface.co/blog/Pclanglais/post-ocr-correction

Post-OCR-Correction dataset: https://huggingface.co/datasets/PleIAs/Post-OCR-Correction

Reacted to victor's post with 🤗 7 months ago

Post

4267

The hype is real: a mysterious gpt2-chatbot model has appeared on the LLM Arena Leaderboard 👀.
It seems to be at least on par with the top performing models (closed and open).

To try it out: https://chat.lmsys.org/ -> then click on the Direct Chat tab and select gpt2-chatbot.

Take your bet, what do you think it is?