DavidGF (David Golchinfar)

posted an update 30 days ago

Post

2992

🎉 Celebrating One Year of #SauerkrautLM with Two Groundbreaking Releases!

We're thrilled to announce the release of SauerkrautLM-v2-14b in two specialized versions: VAGOsolutions/SauerkrautLM-v2-14b-SFT and VAGOsolutions/SauerkrautLM-v2-14b-DPO. Built on the robust Qwen2.5-14B foundation, these models represent a significant leap forward in multilingual AI capabilities.

🔬 Technical Breakthroughs:
💠 Innovative three-phase Fine-Tuning approach
💠 Two-step Spectrum SFT + one-step Spectrum DPO optimization phase for enhanced performance
💠 Balance of German and English language capabilities
💠 Advanced function calling - almost on par with Claude-3.5-Sonnet-20240620

🇩🇪 German Language Excellence:
What sets this release apart is our unique achievement in simultaneously improving both German and English capabilities. Through our specialized training approach with over 1.2B tokens across two phases, we've managed to:
💠 Enhance German language understanding and generation (SFT Version > DPO Version)
💠 Maintain authentic German linguistic nuances
💠 Improve cross-lingual capabilities
💠 Preserve cultural context awareness

📊 Training Innovation:
Our three-phase approach targeted specific layer percentages (15%, 20% and 25%) with carefully curated datasets, including:
💠 Mathematics-focused content (proprietary classifier-selected)
💠 High-quality German training data
💠 Specialized function calling datasets
💠 Premium multilingual content

🎁 Community Contribution:
We're also releasing two new datasets in a few days:
1️⃣ SauerkrautLM-Fermented-GER-DPO: 3,300 high-quality German training samples
2️⃣ SauerkrautLM-Fermented-Irrelevance-GER-DPO: 2,000 specialized samples for optimized function call irrelevance handling

Thank you to our incredible community and partners who have supported us throughout this journey. Here's to another year of AI innovation! 🚀

reacted to MoritzLaurer's post with ❤️ 2 months ago

Post

4252

#phdone - I defended my PhD yesterday! A key lesson: it is amazing how open science and open source can empower beginners with limited resources:

I first learned about instruction-based classifiers like BERT-NLI 3-4 years ago, through the @HuggingFace ZeroShotClassificationPipeline. Digging deeper into this, it was surprisingly easy to find new datasets, newer base models, and reusable fine-tuning scripts on the HF Hub to create my own zeroshot models - although I didn't know much about fine-tuning at the time.

Thanks to the community effect of the Hub, my models were downloaded hundreds of thousands of times after a few months. Seeing my research being useful for people motivated me to improve and upload newer models. Leaving my contact details in the model cards led to academic cooperation and consulting contracts (and eventually my job at HF).

That's the power of open science & open source: learning, sharing, improving, collaborating.

I mean every word in my thesis acknowledgments (screenshot). I'm very grateful to my supervisors @vanatteveldt @CasAndreu @KasperWelbers for their guidance; to @profAndreaRenda and @CEPS_thinktank for enabling me to work part-time during the first year; to @huggingface for creating awesome tools and an awesome platform; and to many others who are not active on social media.

Links to the full thesis and the collection of my most recent models are below.

PS: If someone happens to speak Latin, let me know if my diploma contains some hidden Illuminati code or something :D

4 replies

·

reacted to flozi00's post with ❤️ 3 months ago

Post

1892

🌟 Progress in the German FineWeb edu reproduction 🌟

We're delighted to share the launch of our new Data Quality Classification Model, designed specifically for evaluating educational content in German. This tool uses advanced machine learning techniques to assess texts across all educational levels, from primary school to university.

🔍 Inspired by Huggingface's fine web edu dataset, we've worked hard to refine data classification methods ensuring educators and learners access top-quality resources.
We're excited about the future as we continue improving our models and expanding our datasets.

Access the model here: pL-Community/GermanEduScorer-Qwen2-1.5b

🙏 A huge thank you to David and Daryoush from Vago Solutions; Björn and Jan from Ellamind / DiscoResearch for their expert insights throughout this project. Your support has been crucial.
This project was made possible by the support of PrimeLine AI.

1 reply

·

reacted to alex-abb's post with 🔥 5 months ago

Post

4784

Hi everyone!
I'm Alex, I'm 16, I've been an internship at Hugging Face for a little over a week and I've already learned a lot about using and prompting LLM models. With @victor as tutor I've just finished a space that analyzes your feelings by prompting an LLM chat model. The aim is to extend it so that it can categorize hugging face posts.

alex-abb/LLM_Feeling_Analyzer

4 replies

·

reacted to singhsidhukuldeep's post with 🔥 6 months ago

Post

3182

Here is a thought, instead of telling LLMs what to do, show them! 🎭

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. 🗣️🌍

DITTO from Stanford University proposes that LLMs can be tuned with less than 10 samples! 🤯

What's DITTO? Demonstration ITerated Task Optimization (definitely came up with the acronym first! 😂)

Here is the step-by-step implementation: 🛠️

Initialization: Start with a reference language model (LM), a set of expert demonstrations, a sample size, and a frequency of sampling. 🏁

Supervised Fine-Tuning (SFT): Begin by fine-tuning the reference LM on the set of expert demonstrations to create an initial policy P0. 🎚️

Iterative Comparison Sampling: For each iteration t:
Sample multiple completions from the policy Pt for each demonstration to create a new dataset Dt.
Construct a batch of comparisons where the demonstrations are ranked higher than all sampled model outputs from the current and previous iterations. 🔄

Policy Update:
Update the policy Pt using a Direct Preference Optimization (DPO) algorithm, which incorporates feedback from the batch of comparisons.
Increment the iteration and repeat the sampling and updating process until convergence. ⏭️

Result: The final policy P after sufficient iterations aligns more closely with the expert demonstrations, effectively tuning the LM to reflect user-specific preferences and behaviors. 🎯

DITTO outperforms few-shot prompting. 🚀

Paper: Show, Don't Tell: Aligning Language Models with Demonstrated Feedback (2406.00888) 📄

posted an update 6 months ago

Post

1457

Introducing Kraken-LoRA – a lightweight version of Kraken that uses LoRA-Adapters as Experts based on the base model.

@fernandofernandes , me, @Crystalcareai , @ehartford created the Kraken-LoRA!

🔍 What’s the big deal?

✅ Size Consistency: While Kraken’s size increases with more Experts, Kraken-LoRA remains as compact as the base model (e.g., 8b if you use Meta-Llama3-8b-Instruct).
✅ VRAM Efficiency: Kraken-LoRA is highly VRAM efficient, maintaining the power of all experts without the bloat.
✅ Dynamic Adaptation: LoRA adapters are applied dynamically at runtime, following the routing process.
✅ High Efficiency: Enjoy increased efficiency without compromising performance, as long as the LoRA adapters match the base model.

💡 Conclusion: Kraken-LoRA empowers businesses to experience enhanced flexibility and performance from our architecture, enabling further scalability without sacrificing performance.

Check out the model here: VAGOsolutions/Kraken-LoRA
Explore the code here: https://github.com/cognitivecomputations/kraken/tree/main/Kraken-LoRA

Have fun with Kraken-LoRA! 🐙

posted an update 7 months ago

Post

1567

The kraken has awakened!
A Game-Changer in LLM Flexibility and Performance!

Over the past few weeks, VAGO solutions teamed up with Cognitive Computations and HyperSpace to develop a groundbreaking architecture that redefines flexibility in combining different LLM into one model.

@fernandofernandes , me, @Crystalcareai , @ehartford created the Kraken!

What Can It Do? 🐙
✅ Versatile Architecture: Kraken allows the seamless combination of LLMs with varying sizes, quantizations, and model architectures. It currently supports quantizations in 4-bit, 8-bit, and AWQ, with more on the way. And it runs on Hugging Face Transformers 4.40+

✅ Kraken Router: Utilizing a custom sequence classification model with a context length of 32k tokens, The Kraken Router directs inputs to the most suitable Expert based on their characteristics.

✅ Adaptability: Enhanced input formatting supports the model’s adaptability to diverse conversational contexts.

✅ Extreme Versatility: Easily swap experts within Kraken for your specific use cases without retraining the entire model. For example, if you've built a Kraken for coding in Python you can upgrade your Python model without retraining the router or add a C# model by retraining the router.

✅ Open Source Pipeline: We’re sharing the entire pipeline, including router creation, training, architecture setup, and Kraken inference, on JupyterNotebooks: https://github.com/cognitivecomputations/kraken

Kraken marks the beginning of an exciting new journey in #OpenSource LLM. Why? Because it empowers the open source community in accelerating the catch-up process to proprietary LLMs like #GPT and #Claude 🤩

We proudly introduce the very first 2 Kraken models, that integrates top-tier LLM and Multilingual capabilities:
cognitivecomputations/Kraken
VAGOsolutions/Kraken-Multilingual
Right now it's supported by Hugging Face transformers library. Would love to see the integration into VLM and TGWI!

reacted to anakin87's post with ❤️ 7 months ago

Post

1278

Do you want to play a game against Llama 3? 🦙🦙🦙

Meet 🧑‍🏫 𝐀𝐮𝐭𝐨𝐐𝐮𝐢𝐳𝐳𝐞𝐫, a new LLM application that you can use for learning or just for fun.

Try it out on Hugging Face Spaces 🤗 deepset/autoquizzer

𝐇𝐨𝐰 𝐢𝐭 𝐰𝐨𝐫𝐤𝐬
You provide an URL -> A multiple-choice quiz is instantly generated.

🔹 You can play the quiz yourself.

🔹 You can let the LLM play in two different ways
📕 Closed book: the LLM responds only by knowing the general topic and using its parametric knowledge and reasoning abilities.
🔎🌐 Web RAG: for each question, a Google search is done and the top 3 snippets are included in the prompt for the LLM.

𝐒𝐭𝐚𝐜𝐤
🏗️ Haystack LLM framework https://haystack.deepset.ai/
🦙 Llama 3 8B Instruct
⚡ Groq

Original idea: @Tuana

1 reply

·

reacted to sosoai's post with 🤗 7 months ago

Post

2967

Wow i can post on HF now!
Love HF so much 🤗❤️

6 replies

·

posted an update 8 months ago

Post

1734

Please... feed this Llama some Sauerkraut! 🍲

Said and done. Here it is. Our Sauerkraut Version of the strong Llama3-8b by Meta. Released from HANNOVER MESSE, just in front of meta booth.
VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct

According to benchmarks (LM-Evaluation-Harness 0.4.2), our #SauerkrautLM Dataset and fine-tuning pipeline improved the Model noticeably (AVG = 74,57), especially Reasoning and Common Sense capabilities.

Again we provide some more detail on the whole process:
✅ Original model: Llama-3-8b-Instruct
✅ Training Duration: 12 hours
✅ Training procedure: 2-staged DPO
✅ Trained data: 70k (first stage) and 20k (second stage)
✅ GPU: 4x RTX6000 ADA
✅ New model: Llama-3-SauerkrautLM-8b-Instruct
✅ Total training costs: 54,72 Dollar 💴 - RunPod FTW (excluding synthesizing data, curating data, benchmarks, error handling, testing)

See our model card on Hugging Face for more details: VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct

There will be more details on benchmarks during the next days.

posted an update 8 months ago

Post

3044

"How expensive is it actually to teach a #LanguageModel German through #finetuning 💰💰💰? We get asked this quite often.

There is no one-size-fits-all answer to this question, as among other factors:
⏹ each fine-tuning is different,
⏹ the hardware used can be a major cost driver,
⏹ the amount and type of training data can extend the process,
⏹ and the skills to be trained can increase the difficulty of fine-tuning.

However, we have broken down the costs incurred for our latest fine-tune ( VAGOsolutions/SauerkrautLM-Qwen-32b)

Base model: Qwen/Qwen1.5-32B
Fine-Tuning Goal: Train German language
Training dataset size: 160,000 SFT data / 110,000 DPO data
Training duration: 72.5 hours (2 epochs SFT / 1 epoch DPO)
GPU: 2x A100 SXM
New model: VAGOsolutions/SauerkrautLM-Qwen-32b

Total cost: 312 euros 💶

These are quite reasonable training costs considering the model now speaks passable German (previously very broken). Depending on the use case and process requirements, this can even be a real alternative to the costly continuous pre-training of foreign language models.

reacted to macadeliccc's post with ❤️ 10 months ago

Post

Reducing perplexity in LLM's through layer selective rank reduction

Layer-Selective Rank Reduction (LASER) is a denoising method that improves reasoning by the strategic removal of higher-order components from weight matrices in the multi-layer perceptron (MLP) layers without the need for additional parameters or training data. This process leverages singular value decomposition to identify and eliminate these components. This simple, yet effective, method has shown to improve question-answering performance by up to 27.4 percentage points.

LaserRMT implements this through a process by calculating signal to noise ratio (SNR) for each layer and selectively reducing the rank of these layers.The SNR method meticulously computes the SNR by leveraging singular value decomposition (SVD) to separate the signal (higher-order components) from the noise (lower-order components) within the weight matrices of the model's layers. The SNR calculation is what determines which layers would benefit from rank reduction without compromising the models integrity.

If a layer is identified that could benefit from rank reduction, then the layer will enter an incremental process where the weight matrices are reduced and reconstructed by retaining only the singular values that surpass the threshold. In the case of laserRMT, the threshold is calculated by Marchenko-Pastur Law.

@staticmethod
    def marchenko_pastur_threshold(sigma, n, m):
        beta = n / m if n < m else m / n
        threshold = sigma * np.sqrt((1 + np.sqrt(beta))**2)
        return thr

The two primary benefits of applying this method are reducing computational overhead of large language models and simultaneously improving output quality.

Credit to @ehartford @fernandofernandes @DavidGF for laserRMT

Resources:
☄️ AutoLaser: https://colab.research.google.com/drive/11j0e-w6BfvqeFN1gUrpOqdW0vcKqfVqP?usp=sharing
laserRMT: https://github.com/cognitivecomputations/laserRMT
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction (2312.13558)

8 replies

·

David Golchinfar PRO

AI & ML interests

Recent Activity

Articles

SauerkrautLM's Multi-Phase Spectrum Training: A Technical Deep Dive

Organizations

DavidGF's activity