FBL PRO
AI & ML interests
Articles
Organizations
fblgit's activity
Impressive performances, huge congrats @patrickvonplaten @sgvaze @pandora-s @devendrachaplot @sophiamyang and team!
Very nice to have SOTA Multilingual OCR and Chart understanding in an open-weights model
* MATH Hard 9.81
* MMLU-Pro 29.37
* GPQA 29.19
* MUSR 42.85
* BBH 42.04
Available already in the hub:
fblgit/miniclaus-qw1.5B-UNAMGS
We released today a newest version of Cybertron: V4 based on Qwen2.5 7B and trained on MagPie. Scoring #1 LLM on 7B & 8B class.
The model hasn't go thru DPO, so the weights are in good shape to welcome further training sessions and optimizations.
Enjoy it in the hub as usual:
fblgit/cybertron-v4-qw7B-MGS
Still not being able to get those impressive marks, trying to reproduce something simple with wikitext.. not much "performance" out of it.
Anyone has made this to work and get positive results?
Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.
They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes:
โถ Removed dependencies on previous hidden states in the gates
โท Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients
โธ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)
โก๏ธ As a result, you can use a โparallel scanโ algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences
๐ฅ The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.
And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! ๐
๐ค Why does this matter?
By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!
๐ฌย Franรงois Chollet wrote in a tweet about this paper:
โThe fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)โ
โCurve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.โ
Itโs the Bitter lesson by Rich Sutton striking again: donโt need fancy thinking architectures, just scale up your model and data!
Read the paper ๐ย Were RNNs All We Needed? (2410.01201)
latest 3.5 version of Claude model is even more impressive.. like SEVERAL problems (AI/ML) basically torch, where GPT4o fails epically.. were solved by Claude in 0-Shot.
But also to be said, GPT4o is very impressive using its sandbox.. kudos to that!
We are happy to announce the release of our latest model UNA-ThePitbull, the most powerful model below 70B in the industry. In this new generation, inspired on our previous Beagle series we curated a model that balance nicely EQ and IQ. It was trained with some of the latest datasets including:
* Replete-AI/code_bagel_hermes-2.5
* mlabonne/orpo-dpo-mix-40k
* jondurbin/py-dpo-v0.1
Available in the hub fblgit/UNA-ThePitbull-21.4B-v2 and you can grab Quant versions sponsored by @bartowski at bartowski/UNA-ThePitbull-21.4B-v2-GGUF fully compatible with Ollama, llama.cpp, etc.
UNA
In this case we tried something new by alternating uniformity across layers of both MLP & Attention reducing computational requirements while keep a high performant result.
We trained him under these terms:
* ThePitbull-v1 as base: SFT maxLR 1e-4 minLR 5e-5 for 1 Epoch
* DPO maxLR 1e-4 minLR 5e-5 for 1 Epoch
You can continue the training by merely using 5e-5 maxLR and 0 warmup steps, it should minimize catastrophic forgetting of the model.
Remember if you do so, please include a Pitbull picture on your model and cite :) Have fun!
1. Python/Torch/Transformers/AI/ML
Right off the bat, I threw some complex AI/ML tasks at Claude, and I must say, it handled them with finesse. It even caught a few things that GPT missed! However, let's not get too carried away โ we're not quite at the auto-code level just yet.
2. Brainstorming
This is where Claude falls a bit short. It seems to be more grounded than its competitors, which might not be ideal for generating novel ideas. If you're looking for a brainstorming partner, you might want to look elsewhere.
3. Attention
Despite the claims of super-large attention in the paper, Claude's "forgetting" mechanism seems to be more pronounced. It tends to miss entire chunks of information rather than just specific details like GPT does.
4. Following / Tasks
I hit a roadblock when Claude couldn't generate a LaTeX document. It's not the best at following complex, multi-step tasks.
5. Hallucinations
Oh boy, does Claude hallucinate! And when it does, it's on a whole new level of nonsense. The hallucinations seem to align with its grounded nature, making them even more convincing within the context of the prompt.
6. Sycophancy
Claude is quite the people-pleaser. I've found that using an adversarial brainstorming approach is more beneficial and time-efficient, as it forces me to highlight Claude's mistakes rather than letting it focus on being a sweet, pleasant minion.
7. Interface / UI
There's definitely room for improvement here. Basic features like stepping back on a prompt and stopping generation with the ESC key are missing. These are essential for extracting and composing content effectively.
Despite these limitations, I firmly believe that Claude is currently the #1
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2402.17764)
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
Let me tell you about a post-deployment data science algorithm that we recently developed to measure the impact of Concept Drift on a model's performance.
How can we detect Concept Drift? ๐ค
All ML models are designed to do one thing: learning a probability distribution in the form of P(y|X). In other words, they try to learn how to model an outcome 'y' given the input variables 'X'. ๐ง
This probability distribution, P(y|X), is also called Concept. Therefore, if the Concept changes, the model may become invalid.
โBut how do we know if there is a new Concept in our data?
โOr, more important, how do we measure if the new Concept is affecting the model's performance?
๐ก We came up with a clever solution where the main ingredients are a reference dataset, one where the model's performance is known, and a dataset with the latest data we would like to monitor.
๐ฃ Step-by-Step solution:
1๏ธโฃ We start by training an internal model on a chunk of the latest data. โก๏ธ This allows us to learn the new possible Concept presented in the data.
2๏ธโฃ Next, we use the internal model to make predictions on the reference dataset.
3๏ธโฃ We then estimate the model's performance on the reference dataset, assuming the model's predictions on the monitoring data as ground truth.
4๏ธโฃ If the estimated performance of the internal model and the actual monitored model are very different, we then say that there has been a Concept Drift.
To quantify how this Concept impacts performance, we subtract the actual model's performance on reference from the estimated performance and report a delta of the performance metric. โก๏ธ This is what the plot below shows. The change of the F1-score due to Concept drift! ๐จ
This process is repeated for every new chunk of data that we get. ๐
This new mark outperform some GPT-4 models, closing further the very thin gap between OpenCommunity LLM and Closed source models.
ShinojiResearch/Senku-70B-Full
This new mark outperform some GPT-4 models, closing further the very thin gap between OpenCommunity LLM and Closed source models.
ShinojiResearch/Senku-70B-Full
UNA is a modification of the modeling_$model.py
of transformers. I port it to to the different transformer version and models, keeping it clean and performant, So it works with any of these frameworks like #axolotl
Based on Smaug-34B-v0.1, capable of slightly outperform his base model and with increased math and reasoning thanks to simple-math dataset.
The model exhibits a great performance across diverse tasks with an excellent and balanced behaviour.
It scores 77.41 AVG on the Leaderboard, landing on #1 Position of 34B models.
Available in the hub already:
fblgit/UNA-SimpleSmaug-34b-v1beta
fblgit/simple-math
In this case, we applied UNA to the Attention Layers of the model while performing SFT with simple-math on a high complexity generated data of mathematics, proving the effect of simple-math on LLM's.
Based on Smaug-34B-v0.1, capable of slightly outperform his base model and with increased math and reasoning thanks to simple-math dataset.
The model exhibits a great performance across diverse tasks with an excellent and balanced behaviour.
It scores 77.41 AVG on the Leaderboard, landing on #1 Position of 34B models.
Available in the hub already:
fblgit/UNA-SimpleSmaug-34b-v1beta
fblgit/simple-math
In this case, we applied UNA to the Attention Layers of the model while performing SFT with simple-math on a high complexity generated data of mathematics, proving the effect of simple-math on LLM's.
A straightforward yet insightful tool designed to shed light on the similarities between various models. Discover it now at [Model Similarity GitHub Repository](https://github.com/fblgit/model-similarity).
This project is in its nascent stages, and we're eager for contributions and enhancements. Crafted with simplicity at its core, the tool performs two primary comparisons:
- Weight similarities, utilizing a simple approach to contrast vector differences (A != B).
- Cosine similarity between the parameters of models A and B, providing a nuanced measure of their alignment.
Included in the repository are sample analyses and reports that validate model card claims, particularly regarding the training specifics of transformer components such as MLP, Attention, etc. Remarkably, these samples reveal 100% similarity scores between those parts of the models, pinpointing the exact base model utilized.
Join us in refining and expanding this tool. Whether you're looking to contribute code, ideas, or both, your input will help transform this into a resource for everyone.
A straightforward yet insightful tool designed to shed light on the similarities between various models. Discover it now at [Model Similarity GitHub Repository](https://github.com/fblgit/model-similarity).
This project is in its nascent stages, and we're eager for contributions and enhancements. Crafted with simplicity at its core, the tool performs two primary comparisons:
- Weight similarities, utilizing a simple approach to contrast vector differences (A != B).
- Cosine similarity between the parameters of models A and B, providing a nuanced measure of their alignment.
Included in the repository are sample analyses and reports that validate model card claims, particularly regarding the training specifics of transformer components such as MLP, Attention, etc. Remarkably, these samples reveal 100% similarity scores between those parts of the models, pinpointing the exact base model utilized.
Join us in refining and expanding this tool. Whether you're looking to contribute code, ideas, or both, your input will help transform this into a resource for everyone.
Recently we uploaded on the hub our LATEST and most powerful version of SimpleMath SFT dataset.
Today we are happy to present SimpleMath DPO Pairs, improving further mathematical capabilities on LLM's.
Our first results shows clear improvements on GSM8k, MATHQA, ARC, TQA, MMLU and BBH. Feel free to experiment and generate your own dataset, as we also provide the code to generate them synthetically.
fblgit/simple-math
fblgit/simple-math-DPO
fblgit/UNA-34BeagleSimpleMath-32K-v1
Recently we uploaded on the hub our LATEST and most powerful version of SimpleMath SFT dataset.
Today we are happy to present SimpleMath DPO Pairs, improving further mathematical capabilities on LLM's.
Our first results shows clear improvements on GSM8k, MATHQA, ARC, TQA, MMLU and BBH. Feel free to experiment and generate your own dataset, as we also provide the code to generate them synthetically.
fblgit/simple-math
fblgit/simple-math-DPO
fblgit/UNA-34BeagleSimpleMath-32K-v1
we working on it my friend, LASER team is awesome. We are investigating further these two together how they amplify. The improvements on performance are larger than the usual tho we are empirically testing such thing.