Andromeda Model Training Standard Operating Procedure

This document provides instructions on how to train the Andromeda model end-to-end using the provided code. The training procedure consists of three main scripts: build_dataset.py, model.py, and train_distributed.py. Follow the steps below to train the Andromeda model.

Prerequisites

Before starting the training process, ensure that you have the following requirements:

Python 3.7 or higher
PyTorch 1.9 or higher
Transformers library
Datasets library
Accelerate library
Wandb library (optional, for logging)

Step 1: Building the Dataset

The first step is to build the dataset required for training. The build_dataset.py script processes the training data and prepares it for training. Follow the instructions below to build the dataset:

Open the build_dataset.py script.
Set the configuration parameters in the CFG class according to your requirements:
- HF_ACCOUNT_REPO: Replace with your Hugging Face API key.
- TOKENIZER: Choose the tokenizer model to use (e.g., "EleutherAI/gpt-neox-20b").
- DATASET_NAME: Choose the dataset to process (e.g., "tiiuae/falcon-refinedweb").
- SEQ_LEN: Set the desired sequence length.
Save the changes to the script.
Open a terminal or command prompt and navigate to the directory containing the build_dataset.py script.
Run the following command to execute the script:
```
python build_dataset.py
```
The script will process the dataset and push it to your Hugging Face account repository specified by HF_ACCOUNT_REPO.

Step 2: Defining the Andromeda Model

The second step is to define the Andromeda model architecture. The model.py script contains the model definition and configuration. Follow the instructions below to configure the Andromeda model:

Open the model.py script.
Set the configuration parameters in the AndromedaTokenizer and Andromeda classes according to your requirements:
- tokenizer: Configure the tokenizer with the desired parameters.
- Andromeda: Configure the Andromeda model with the desired architecture.
Save the changes to the script.

Step 3: Training the Andromeda Model

The final step is to train the Andromeda model using the train_distributed.py script. Follow the instructions below to start the training process:

Open the train_distributed.py script.
Set the configuration parameters in the TrainAndromeda.CFG class according to your requirements:
- BATCH_SIZE: Set the batch size for training.
- GRADIENT_ACCUMULATE_EVERY: Set the number of gradient accumulation steps.
- LEARNING_RATE: Set the learning rate for the optimizer.
- WEIGHT_DECAY: Set the weight decay for the optimizer.
- SEQ_LEN: Set the desired sequence length.
- USE_DEEPSPEED: Set to True if using DeepSpeed for optimization.
- USE_FSDP: Set to True if using Fully Sharded Data Parallelism.
- USE_PRETOKENIZED: Set to True if using a pre-tokenized dataset.
- USE_ACTIVATION_CHECKPOINTING: Set to True if using activation checkpointing.
- RESUME_FROM_CHECKPOINT: Set to the path of a checkpoint to resume training from.
- CHECKPOINTING_STEPS: Set the number of steps between checkpoints.
- OUTPUT_DIR: Set the output directory for saving the model checkpoints and logs.
- ENTITY_NAME: Set the Wandb entity name for logging (optional).
Save the changes to the script.
Open a terminal or command prompt and navigate to the directory containing the train_distributed.py script.
Run the following command to start the training:
```
python train_distributed.py
```
The script will train the Andromeda model using the specified configuration and dataset.
During training, the progress will be displayed in the terminal, and logs will be saved to the specified output directory.

Other Training methods

First:

Accelerate Config

Enable Deepspeed 3:

Accelerate launch train_distributed_accelerate.py