## Below is an example yaml for BF16 mixed-precision training using Megatron-LM with 2x Data Parallelism, 2x Pipeline Parallelism, and 2x Tensor Parallelism on 8 GPUs. It is also using Sequence Parallelism, selective activation checkpointing, and a sharded optimizer.
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
+distributed_type: MEGATRON_LM
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
+megatron_lm_config:
+  megatron_lm_gradient_clipping: 1.0
+  megatron_lm_num_micro_batches: 2
+  megatron_lm_pp_degree: 2
+  megatron_lm_recompute_activations: true
+  megatron_lm_sequence_parallelism: true
+  megatron_lm_tp_degree: 2
+  megatron_lm_use_distributed_optimizer: true
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
##
from accelerate import Accelerator
+from accelerate.utils import MegatronLMDummyScheduler
  
accelerator = Accelerator()

...

-lr_scheduler = get_scheduler(
-    name=args.lr_scheduler_type,
-    ...
-)
+lr_scheduler = MegatronLMDummyScheduler(
+    optimizer=optimizer,
+    num_warmup_steps=...,
+    num_training_steps=...,
+)
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)

total_batch_size = (
-    args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
+    accelerator.state.megatron_lm_plugin.global_batch_size
)
# in evaluation loop
for step, batch in enumerate(eval_dataloader):
    with torch.no_grad():
        outputs = model(**batch)
    loss = outputs.loss
-    losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size)))
+    losses.append(loss) # For Megatron-LM, the losses are already averaged across the data parallel group
-losses = torch.cat(losses)
+losses = torch.tensor(losses)
## If the YAML was generated through the `accelerate config` command: ``` accelerate launch {script_name.py} {--arg1} {--arg2} ... ``` If the YAML is saved to a `~/config.yaml` file: ``` accelerate launch --config_file ~/config.yaml {script_name.py} {--arg1} {--arg2} ... ``` Or you can use `accelerate launch` with right configuration parameters and have no `config.yaml` file: ``` accelerate launch \ --use_megatron_lm \ --num_processes=8 \ --mixed_precision=bf16 \ --megatron_lm_tp_degree=2 \ --megatron_lm_pp_degree=2 \ --megatron_lm_num_micro_batches=2 \ --megatron_lm_sequence_parallelism=true \ --megatron_lm_recompute_activations=true \ --megatron_lm_use_distributed_optimizer=true \ {script_name.py} {--arg1} {--arg2} ... ``` ## For Megatron-LM, the supported models Transformers GPT2, Megatron-BERT and T5 models covering Decoder only, Encode only and Encoder-Decoder model classes. Given the complexity of the features of Megatron-LM, 4 changes that are required to get started are: 1. Using `accelerate.utils.MegatronLMDummyScheduler` as Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used. 2. Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes. 3. Losses are already averaged across the data parallel group 4. save the model using `accelerator.save_state` instead of transformers `from_pretrianed` The Accelerate Megatron-LM integration supports many advanced features such as: - Leveraging custom training steps - Using Megatron-LM indexed datasets - Checkpoint reshaping and interoperabiloity utilities - Using `megatron_generate` for text generation using Tensor and Pipeline Parallism - Support for ROPE/ALibi Positional embeddings and Multi-Query Attention However, each of these require more changes to your source code than what is presented here. ## To learn more checkout the related documentation: - How to use Megatron-LM - Examples showcasing the Megatron-LM integration of Accelerate