Abstract
The recent breakthrough successes in machine learning are mainly attributed to scale: namely large-scale attention-based architectures and datasets of unprecedented scale. This paper investigates the impact of training at scale for chess. Unlike traditional chess engines that rely on complex heuristics, explicit search, or a combination of both, we train a 270M parameter transformer model with supervised learning on a dataset of 10 million chess games. We annotate each board in the dataset with action-values provided by the powerful Stockfish 16 engine, leading to roughly 15 billion data points. Our largest model reaches a Lichess blitz Elo of 2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit search algorithms. We also show that our model outperforms AlphaZero's policy and value networks (without MCTS) and GPT-3.5-turbo-instruct. A systematic investigation of model and dataset size shows that strong chess performance only arises at sufficient scale. To validate our results, we perform an extensive series of ablations of design choices and hyperparameters.
Community
This tells me that our current architectures are way more powerful than we give them credit for, that they have so much untapped potential, that can be unlocked with "smarter, more complex" data. I.E imagine we had the Internet archive data from an advanced alien civilization and then trained our current models on it, they would be orders of magnitude better, but with the same architecture
yeah that "untapped potental" = thousands of GPUs = $$$
yeah that "untapped potental" = thousands of GPUs = $$$
Sure you can throw GPU's at it, but that's not what I mean, I'm taking about current architectures and methodologies but trained on (non-existent) ultra high quality data.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Checkmating One, by Using Many: Combining Mixture of Experts with MCTS to Improve in Chess (2024)
- Fast and Knowledge-Free Deep Learning for General Game Playing (Student Abstract) (2023)
- AI capabilities can be significantly improved without expensive retraining (2023)
- The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction (2023)
- PokerGPT: An End-to-End Lightweight Solver for Multi-Player Texas Hold'em via Large Language Model (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
" We annotate each board in the dataset with action-values provided by the powerful Stockfish 16 engine"
As I understood this, the hard part of learning the state-action pair values was outsourced to a specialist engine (which btw uses search creating the predictions).
Whilst I think it's an interesting experiment I don't see immediately what new insights it gives to us.
This paper contains almost no novelty. To imply the method does not use search is completely disingenuous. It is trained on stockfish search targets. Which not only uses search, but also contains many human developed heuristics. Therefore this method simply performs one stage of expert iteration except using a high fine tuned sophisticated expert that is not even fully AI. Presumably their only contribution is neural network architectures which do not seem considerably novel or different from those used in Leela zero. Would recommend skipping this paper.
@eelang
I think these days combining human annotated datasets + pseudolabelled datasets and dumping them to large models (or even self-training) just works well. it begs the question why we haven't tried this before. this pattern helped developing many foundation models in other domains like OWLv2 or SAM. I guess this is just yet another adaptation of the same recipe.
so 1. find a scalable architecture, 2. scale it not only with GPUs (which was last year's trend) but also with data coverage through manually labelled + pseudolabelled datasets 3. (optional) distill/quantize seems to be the trend of 2024.
@theswifter01
a lot of GPUs gets one only so far, data coverage is the 👑
Grandmaster Chess Moves: A Search-Free Transformation
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper