RoboApocalypse's picture
Update README.md
09545c9 verified
metadata
license: openrail
datasets:
  - teknium/OpenHermes-2.5
  - wikimedia/wikipedia
library_name: transformers

mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast

This repository contains the mini-mistral-360M model, a 360 million parameter version of the Mistral architecture, trained for a single epoch. The model was trained on a diverse dataset comprising Wikipedia articles and the OpenHermes dataset. While this model is still in its early stages and not particularly useful as of now, it serves as an experimental showcase of integrating the Grokfast algorithm into the training process.

Model Details

  • Architecture: Mistral
  • Parameters: 360 million
  • Training Duration: 1 epoch
  • Training Dataset: Wikipedia articles and OpenHermes dataset
  • Training Method: Grokfast-enhanced Transformers
  • Training Hardware: 2 x Nvidia RTX 3060 12GB

Purpose

The primary goal of this experiment was to observe the impact of the Grokfast algorithm on the training dynamics of a 360M parameter Mistral model. During training, it was noted that the evaluation loss followed the training loss closely, which is an intriguing behavior warranting further investigation.

Usage

To use this model, you can load it with the transformers library from HuggingFace:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("RoboApocalypse/mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast")
model = AutoModel.from_pretrained("RoboApocalypse/mini-mistral-360M-wikipedia-20231101.en-science-sci-fi-OpenHermes-2.5-chatML-Grokfast")

# Example usage
input_text = "Hello, world!"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)

Training Insights

This experiment was inspired by the paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients" by Jaerin Lee, Bong Gyun Kang, Kihoon Kim, and Kyoung Mu Lee, aims to accelerate the generalization of models under the grokking phenomenon. The paper is available at https://arxiv.org/abs/2405.20233

Acknowledgments

Special thanks to the YouTube channel Tunadorable for bringing the Grokfast paper to my attention in his video "Accelerated Training by Amplifying Slow Gradients". Tunadorable reads and discusses AI papers from arXiv, providing valuable insights into the latest research.

Disclaimer

This model is not optimized for practical use and should be considered experimental. It has only been trained for a single epoch, and its performance is not guaranteed to be reliable or accurate. Future iterations and more extensive training may improve its capabilities.

Contributing

If you are interested in discussing, contributing or have any suggestions, please reach out or open an issue on the repository.

License

This model is licensed under the OpenRAIL License.


Feel free to check out the model and experiment with it here. Your feedback and insights are welcome as I try and figure out wtf I'm doing.