license: openrail
datasets:
- teknium/OpenHermes-2.5
- wikimedia/wikipedia
library_name: transformers
mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast
This repository contains the mini-mistral-360M model, a 360 million parameter version of the Mistral architecture, trained for a single epoch. The model was trained on a diverse dataset comprising Wikipedia articles and the OpenHermes dataset. While this model is still in its early stages and not particularly useful as of now, it serves as an experimental showcase of integrating the Grokfast algorithm into the training process.
Model Details
- Architecture: Mistral
- Parameters: 360 million
- Training Duration: 1 epoch
- Training Dataset: Wikipedia articles and OpenHermes dataset
- Training Method: Grokfast-enhanced Transformers
- Training Hardware: 2 x Nvidia RTX 3060 12GB
Purpose
The primary goal of this experiment was to observe the impact of the Grokfast algorithm on the training dynamics of a 360M parameter Mistral model. During training, it was noted that the evaluation loss followed the training loss closely, which is an intriguing behavior warranting further investigation.
Usage
To use this model, you can load it with the transformers
library from HuggingFace:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("RoboApocalypse/mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast")
model = AutoModel.from_pretrained("RoboApocalypse/mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast")
# Example usage
input_text = "Hello, world!"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
Training Insights
This experiment was inspired by the paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients" by Jaerin Lee, Bong Gyun Kang, Kihoon Kim, and Kyoung Mu Lee, aims to accelerate the generalization of models under the grokking phenomenon. The paper is available at https://arxiv.org/abs/2405.20233
Acknowledgments
Special thanks to the YouTube channel Tunadorable for bringing the Grokfast paper to my attention in his video "Accelerated Training by Amplifying Slow Gradients". Tunadorable reads and discusses AI papers from arXiv, providing valuable insights into the latest research.
Disclaimer
This model is not optimized for practical use and should be considered experimental. It has only been trained for a single epoch, and its performance is not guaranteed to be reliable or accurate. Future iterations and more extensive training may improve its capabilities.
Contributing
If you are interested in discussing, contributing or have any suggestions, please reach out or open an issue on the repository.
License
This model is licensed under the OpenRAIL License.
Feel free to check out the model and experiment with it here. Your feedback and insights are welcome as I try and figure out wtf I'm doing.