BramVanroy (Bram Vanroy)

reacted to davidberenstein1957's post with 🚀 3 months ago

Post

2991

🚀 We will be generating a preference dataset for DPO/ORPO and cleaning it with AI feedback during our upcoming meetup!

In this session, we'll walk you through the essentials of building a distilabel pipeline by exploring two key use cases: cleaning an existing dataset and generating a preference dataset for DPO/ORPO. You’ll also learn how to make the most of AI feedback, integrating Argilla to gather human feedback and improve the overall data quality.

This session is perfect for you
- if you’re getting started with distilabel or synthetic data
- if you want to learn how to use LLM inference endpoints for **free**
- if you want to discover new functionalities
- if you want to provide us with new feedback

Sign up here: https://lu.ma/dt0c7jru

posted an update 6 months ago

Post

1537

The InstructGPT paper mentions that they insert 10% pretraining data during SFT, which they find improves the effect of PPO (IIUC). Has anyone else done later ablations on this? I've only seen the inverse suggested, mixing in SFT data during pretraining.

2 replies

·

reacted to lunarflu's post with ❤️ 6 months ago

Post

1898

cooking up something....anyone interested in a daily activity tracker for HF?

12 replies

·

posted an update 6 months ago

Post

2233

All my models seem to be plagued by infinite lists. When you ask a question that requires it to write a list, it most often keeps adding bullet points or enumeration. I am wondering whether this is a result of using chatty GPT-4 as DPO preferences. Any thoughts?

1 reply

·

replied to their post 7 months ago

Nice! In my experience preference tuning with the ultra feedback datasets does not really change benchmark scores (and sometimes even makes them worse) but it does seem to improve the real-world user experience when chatting with the model.

I'm also not sure if orpo only on UF is better than sft on UC + DPO on UF, especially if you're also trying to do language adaptation. That, or first continue pretraining the model and then doing orpo.

replied to their post 7 months ago

While the "rules" of OpenAI do get frustrating from time to time, I do not blame others who do not follow the same path as I do. If I am asked why my licenses are different from someone else's I will answer according to what I've written in the post above (the rules suck and our vague, I understand why people do what they do and I do what I do because of other reasons). But I definitely do not want to go around and point fingers pre-emptively in hopes that people just use my models. Our community for Dutch is already quite small so I rather just lift each other up and build on each others work through friendly "competition" than to compete in bad faith.

So I think that for my future models, I'll just make use of ultrachat+ultrafeedback, which should be cleared for training apache 2.0 models because they were created with Azure. This may negatively impact the model's performance (especially for code because it does not include the Stack Overflow set) but I hope the impact is limited.

replied to their post 7 months ago

What do you mean with compliance in this context? I'm not sure how I can market being non-commercial as a good thing 😅

replied to their post 7 months ago

Cool! Looking forward to what you'll build with this!

posted an update 7 months ago

Post

2272

🥳 New license for datasets: Apache 2.0!

I have been struggling mentally for many months now with the OpenAI terms of use that indicate that their model outputs cannot be used to build "competing models". This leads to many questions:

- what is the definition of competing? Is it the same as "commercial"?
- since this is part of the terms of use between OpenAI and the API user, can a third party still use the generated dataset to build competing models?
- are such restrictions even legal in the first place?

Trying to "follow the rules" as much as possible despite wanting to be as open as possible, I kept releasing my datasets under non-commercial licenses (which are too restrictive anyhow - nothing should prevent you from using the data in non-LM commercial settings), just like models trained on these datasets. This has put me at a competitive disadvantage compared to creators who do not follow the same approach and release their data/models on apache 2.0 despite the OpenAI "restrictions". Moreover, I fear (https://twitter.com/BramVanroy/status/1780220420316164246) that my approach blocks adaptation of my data/models for (commercial) applications/integrations.

Thankfully @Rijgersberg noted that these OpenAI terms of use are NOT explicit in the Azure OpenAI API (https://twitter.com/E_Rijgersberg/status/1780308971762450725). Since my latest datasets were created via Azure, this comes as a relief. As far as I can tell after digging through Azure docs, this allows me to change all recent GPT4-generated datasets to apache 2.0! 🥳

- BramVanroy/ultrachat_200k_dutch
- BramVanroy/orca_dpo_pairs_dutch
- BramVanroy/ultra_feedback_dutch
- BramVanroy/ultra_feedback_dutch_cleaned
- BramVanroy/no_robots_dutch

I will have to mull over what I'll do for the older GPT3.5 datasets. What do you think that I should do?

9 replies

·

replied to their post 8 months ago

I think the correct place for this is to make a new issue on their issue tracker: https://github.com/ScandEval/ScandEval/issues

posted an update 8 months ago

Post

2436

🎈 LLM Benchmarks Update!

**tl;dr: do not depend on benchmark leaderboards to choose your "chatbot" model! (Especially for non-English languages.)**

First of all, I'm discontinuing the Open #Dutch #LLM Leaderboard (https://lnkd.in/eFnsaFR6). It will stay online for now, but I urge the use of the ScandEval leaderboard instead (https://scandeval.com/dutch-nlg/) by @saattrupdan . It contains more tasks, has better reproducibility and statistics (CI) and a flexible back-end library (scandeval) to run your own benchmarks with. As part of project "Leesplank" (with Michiel Buisman and Maarten Lens-FitzGerald) we recently added GPT-4-1106-preview scores to add a good "target" to the leaderboard.

An important note here is that benchmark leaderboards are not a golden truth. Especially evaluating generative models is hard. You run into issues like prompt engineering (and sensitivity of models to one or other prompt), structured output generation, and - quite simply - "how to automatically evaluate open-ended generation".

💡 Another important but under-discussed facet is the discrepancy between models' capability of understanding vs. generating *in different languages* (so the NLU part of NLG benchmarking). In other words: some of the listed models score really well on, e.g., MCQ benchmarks but are not suitable to use as DUTCH chat bots. Interestingly, some of these models seem to understand questions in Dutch and are able to pick the right answer (because they have good knowledge or reasoning skills), but generating fluent and grammatical Dutch is something else entirely! This is perhaps also true for humans: it's easier to sort-of grasp the meaning of a new language and answer with "Yes" or "No", but answering fluently in the language is much harder! Yet, your language production fluency does not necessarily say anything about your knowledge and reasoning skills.

Hopefully we can get a chat arena for Dutch some day - user feedback is the most powerful metric!

3 replies

·

replied to their post 8 months ago

Understandable. I'm especially attracted to the broad vocabulary, which can be of use for language adaptation.

replied to their post 8 months ago

What kind of weird results? In terms of loss, or really qualitative output?

posted an update 8 months ago

Post

2391

Does anyone have experience with finetuning Gemma? Even the 2B variant feels more memory heavy than mistral 7B. I know that its vocabulary is much larger (250k) but I'm a bit surprised that the max batch size that I can get in an A100 80GB is only 2 whereas I could fit 4 with mistral 7B - even though Gemma is much smaller except for the embedding layer. Both runs were using FA, same sequence length, same deepspeed zero 3 settings. Oh and yes I'm using the most recent hot fix of transformers that solves a memory issue with Gemma and others.

Any prior experience that you can share or suggestions to improve throughout?

4 replies

·

replied to their post 8 months ago

Indeed, there is not a lot of metadata. There's also a discrepancy between the no. scores/languages and the no. paragraphs in the text. I've notified the authors about that. CulturaX is an attractive dataset, too!

posted an update 8 months ago

Post

1723

🖴 The HPLT monolingual dataset has a new home!

After being in touch with HPLT folks, I've transfered the data to their org. That only makes sense. You can find it below.

HPLT/hplt_monolingual_v1_2

3 replies

·

posted an update 9 months ago

Post

🗄️ Massive data release on the HF Hub for 75 languages!

https://huggingface.co/datasets/BramVanroy/hplt_monolingual_v1_2

In December of last year, HPLT (https://hplt-project.org/) released version 1.2 of their dataset. It covers web-crawled data of 75 languages!, in the raw format as well as deduplicated and cleaned sections. In total, we're talking about over 40TB of data! This data was already accessible via their website but I figured the accessibility could be improved by an integration with Hugging Face tooling. 🤗 So I added the dataset here to the Hugging Face hub, enabing direct use in your conventional training pipelines for LLMs or other language technologies. The data will automatically be downloaded and optimised with just one line of code:

load_dataset("BramVanroy/hplt_mono_v1_2", "nl_cleaned")

Let's use this big blob of data to build something awesome in our languages! 🥳

1 reply

·

replied to dvilasuero's post 9 months ago

Really cool, crowd-sourcing like this can be very powerful!

Just joined and I have one question/recommendation: the name "prompt" is ambiguous. I got the following conversation:

Marv is a chatbot that reluctantly answers questions with sarcastic responses:

You: How many pounds are in a kilogram?
Marv: This again? There are 2.2 pounds in a kilogram. Please make a note of this.
You: What does HTML stand for?
Marv: Was Google too busy? Hypertext Markup Language. The T is for try to ask better questions in the future.
You: When did the first airplane fly?
Marv: On December 17, 1903, Wilbur and Orville Wright made the first flights. I wish they’d come and take me away.
You: What is the meaning of life?
Marv: I’m not sure. I’ll ask my friend Google.
You: Why is the sky blue?

Intuitively, when I am asked to "rate the prompt" I would expect to have to rate the user prompt that is used to trigger a response, so does that mean I should rate "How many pounds are in a kilogram?" Or do I have to rate all the responses of "Marv"? Or do I have to rate the whole conversation? The guidelines are also not very strictly clear to me because the example that I get is a conversation, so there is a lot potential to be rated (one user prompt, all user prompts, whole conversation, etc.)

Hope you see my confusion. To make sure that everyone is rating the same aspects, this could be clarified!

posted an update 10 months ago

Post

📣 DPO Dutch model release + datasets

After teasing for a while, I am finally releasing **GEITje 7B Ultra**, building upon the great GEITje 7B by @Rijgersberg . New contributions include: large new datasets for SFT (instruction/chat), two datasets for DPO training (i.e. RLAIF), and an SFT and DPO version of GEITje. The READMEs describe everything well (I hope), and I'll also share more info on social medias tomorrow.

For me this is a huge release, the datasets more so than the models. I'm especially pleased with UltraChat, which I created with the intent of having a diverse dataset - the model must be able to communicate with different types of users. So the user questions are created as if they were written by different personas, e.g. language learners, young children, experts, critics, etc. The focus with this is "building a good communication bot that is accessible and can handle different kinds of user input".

I wish I could find the time to also write a paper to get some "academic recognition" but that'll have to wait for now. I just want to bring it to the public so that others can play with it and use it to build new, cool stuff!

I hope that you can all appreciate the work. Let's build some cool stuff with it!

Models:
- Demo: https://huggingface.co/spaces/BramVanroy/GEITje-7B-ultra
- DPO Model: BramVanroy/GEITje-7B-ultra
- SFT model (not recommended): BramVanroy/GEITje-7B-ultra-sft

Datasets with GPT-4 turbo completions:
- No robots (~10k instructions): BramVanroy/no_robots_dutch
- UltraChat (~200k instructions): BramVanroy/ultrachat_200k_dutch
- UltraFeedback (DPO with GPT4+GEITje chat, ~50k): BramVanroy/ultra_feedback_dutch
- Orca DPO Pairs (DPO with GPT4+GEITje chat, ~10k): BramVanroy/orca_dpo_pairs_dutch

3 replies

·

replied to their post 10 months ago

From my limited experience, looking at potential unwanted text properties of chosen v. rejected can be another one to investigate. I think the model may learn, for instance, that longer sequences are better and will therefore learn to generate longer sequences regardless of the quality of the content. You can catch this in the metrics, I believe, but only in the log probs (which would then also be very low for the chosen text, perhaps even lower than the rejected text). You'll likely not notice this in the rewards metrics.

Bram Vanroy PRO

AI & ML interests

Recent Activity

Organizations

BramVanroy's activity