--- license: openrail datasets: - teknium/OpenHermes-2.5 - wikimedia/wikipedia library_name: transformers --- # mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast This repository contains the **mini-mistral-360M** model, a 360 million parameter version of the Mistral architecture, trained for a single epoch. The model was trained on a diverse dataset comprising Wikipedia articles and the OpenHermes dataset. While this model is still in its early stages and not particularly useful as of now, it serves as an experimental showcase of integrating the Grokfast algorithm into the training process. ## Model Details - **Architecture**: Mistral - **Parameters**: 360 million - **Training Duration**: 1 epoch - **Training Dataset**: Wikipedia articles and OpenHermes dataset - **Training Method**: Grokfast-enhanced Transformers - **Training Hardware**: 2 x Nvidia RTX 3060 12GB ## Purpose The primary goal of this experiment was to observe the impact of the Grokfast algorithm on the training dynamics of a 360M parameter Mistral model. During training, it was noted that the evaluation loss followed the training loss closely, which is an intriguing behavior warranting further investigation. ## Usage To use this model, you can load it with the `transformers` library from HuggingFace: ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("RoboApocalypse/mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast") model = AutoModel.from_pretrained("RoboApocalypse/mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast") # Example usage input_text = "Hello, world!" inputs = tokenizer(input_text, return_tensors="pt") outputs = model(**inputs) ``` ## Training Insights This experiment was inspired by the paper ["Grokfast: Accelerated Grokking by Amplifying Slow Gradients" by Jaerin Lee, Bong Gyun Kang, Kihoon Kim, and Kyoung Mu Lee](https://arxiv.org/abs/2405.20233), aims to accelerate the generalization of models under the grokking phenomenon. The paper is available at https://arxiv.org/abs/2405.20233 ## Acknowledgments Special thanks to the YouTube channel [Tunadorable](https://youtube.com/@tunadorable) for bringing the Grokfast paper to my attention in his video ["Accelerated Training by Amplifying Slow Gradients"](https://youtu.be/__xQw60y200). Tunadorable reads and discusses AI papers from arXiv, providing valuable insights into the latest research. ## Disclaimer This model is not optimized for practical use and should be considered experimental. It has only been trained for a single epoch, and its performance is not guaranteed to be reliable or accurate. Future iterations and more extensive training may improve its capabilities. ## Contributing If you are interested in discussing, contributing or have any suggestions, please reach out or open an issue on the repository. ## License This model is licensed under the OpenRAIL License. --- Feel free to check out the model and experiment with it [here](https://huggingface.co/RoboApocalypse/mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast). Your feedback and insights are welcome as I try and figure out wtf I'm doing.