Optimizing Deep Learning Training Techniques
Lingvanex specializes in machine translation and provides innovative solutions that help users and businesses effectively overcome language barriers. Our machine translation technologies ensure accuracy, speed, and convenience in communication across different languages, offering high-quality translations for both individuals and companies.
This article delves into several advanced techniques designed to improve training efficiency and effectiveness. We will discuss methods that help in the gradual adjustment of model parameters, which can lead to more stable learning processes. By fine-tuning how and when model weights are updated, these techniques aim to enhance convergence and ultimately yield better results. Furthermore, the article will cover strategies for managing learning rates, which play a pivotal role in determining how quickly a model learns. Understanding how to adjust these rates over time can significantly influence the training dynamics and lead to faster and more accurate models.
Finally, we will explore the importance of checkpoint management, which allows for better utilization of trained models by averaging weights from multiple training sessions. This can help to mitigate the effects of overfitting and ensure that the model retains the best features learned during its training journey.
Exponential Moving Average
In the default configuration file of the Transformer model, the parameter moving_average_decay is not set. Having set the moving_average_decay parameter to a value close to one (according to the tensorflow documentation), the next step is to calculate an exponential moving average of the model weights. According to the documentation, applying moving_average_decay to the model weights can significantly improve the model results.The algorithm moving_average_decay is as follows:
- at each training step, after calculating and applying gradients, the MovingAverage class is initialized;
- after initializing MovingAverage, the function for updating the model weights is called;
- the model weights are updated as follows:
- the decay coefficient is calculated: decay = 1 - min(0.9999, (1.0 + training_step) / (10.0 + training_step))
- the following algorithm is applied to each model weight: shadow_weigth = previous_weight - (previous_weight - current_weight) * decay (at the first training step, previous_weight = current_weight)
- the smoothed weights after each training step are stored in the MovingAverage class ; the replacement of trained weights with smoothed weights occurs only when the model checkpoint is saved.
Simplified calling sequence:
- def call() class Trainer module training.py
- def init() class MovingAverage module training.py
- def _update_moving_average() class Trainer module training.py
- def update() class MovingAverage module training.py
Learning Rate Decay Mechanism
The learning rate decay mechanism uses variables initialized in the NoamDecay b and ScheduleWrapper classes. After each training step, the following transformations occur in the ScheduleWrapper class:- the step variable is calculated using the tf.math.maximum function → tf.maximum(step - step_start, 0) = 1;
- the step variable is adjusted by the step_duration value by integer division → step //= step_duration = 1 // 1 = 1;
- the step variable adjusted at the previous step is passed to the NoamDecay class .
In the NoamDecay class the following transformations occur:
- the step variable is calculated → step = step + 1 = 2;
- intermediate value a : using the tf.math.pow function the model_dim value is raised to the power of -0.5 , which is equivalent to one divided by the square root model_dim → 1 / sqrt(4) = 0.5;
- intermediate value b : using the tf.pow function the step value obtained above is raised to the power of -0.5 , which is equivalent to one divided by the square root step → 1 / sqrt(2) = 0.7071;
- intermediate value c : using tf.pow function , the warmup_steps value is raised to the power of -1.5 and multiplied by the step value → (1 / 8000^1.5) * 2 = 0.000001397 * 2 = 0.000002795;
- using the tf.math.minimum function , the minimum value of two intermediate values b and c is determined → min(b, c) → 0.000002795;
- the resulting minimum value is multiplied by the intermediate value a and scale → 0.0000027951 * 0.5 * 2 = 0.000002795;
- the full cycle of intermediate transformations looks like this (scale * tf.pow(model_dim, -0.5) * tf.minimum(tf.pow(step, -0.5), step * tf.pow(warmup_steps, -1.5)));
- the value obtained above 0.000002795 is returned back to the ScheduleWrapper class .
In the ScheduleWrapper class , the final value of the coefficient is defined: learning rate = tf.maximum(learning_rate, minimum_learning_rate) → learning rate = max(0.000002795, 0.0001) = 0.000002795 = 0.0001 . This is the value that is output to the training log: Step = 1; Learning rate = 0.000100; Loss = 3.386743 .
Using the algorithm described above, we will plot a graph of changes in the learning rate value of the optimizer for a model with a dimension of 768 and the specified parameter Learning rate = 2 in the trainer configuration file.
Now let's plot a graph of changes in the learning rate value of the optimizer for a model with a dimension of 768 and the specified parameter Learning rate = 6 in the trainer configuration file.
From the graphs, we can conclude that with a decrease in the warmup_steps value, the learning rate of the optimizer rapidly increases, while with an increase in the learning rate in the config, it is possible to achieve higher values of the learning rate of the optimizer, which can contribute to faster learning with a large model dimension.
You can also influence the change in the learning rate using the start_decay_steps parameter, i.e. we can specify how many steps after the start of training the warmup_steps mechanism and subsequent decay will be applied.
The graph below shows that with start_decay_steps = 10,000 , the first 10 thousand steps the model is trained with a fixed the learning rate value, which is equal to the minimum, and after 10 thousand steps the warmup_steps mechanism with decay starts to work.
The decay_step_duration parameter can be used to increase the duration of the warmup_steps mechanism and slow down the decay rate.
Simplified calling sequence:
- def call () class ScheduleWrapper module schedules/lr_schedules.py
- def call () class class NoamDecay module schedules/lr_schedules.py
- def call () class ScheduleWrapper module schedules/lr_schedules.py
Checkpoint Averaging Mechanism
A checkpoint is a model state at a certain training step. A trained model checkpoint stores the model weights changed during training, the optimizer variables for each layer (the optimizer state at a certain training step), and a computation graph. An example of a small computation graph for a simple network is shown in image number 6 - the computation graph. The optimizer is highlighted in red, regular variables are blue, and optimizer slot variables are orange. Other nodes are highlighted in black. Slot variables are part of the optimizer state, but are created for a specific variable. For example, the 'm' edges above correspond to the momentum that the Adam optimizer tracks for each variable.At the end of training, the last checkpoints of the model are read from the model directory and restored in a quantity equal to the average_last_checkpoints parameter.
According to the architecture of the trained model, weights are initialized with a value of zero for all layers of the model.
Next in the loop, for each restored checkpoint, the weights are read; the weights of each layer are divided by the number of checkpoints specified in the average_last_checkpoints parameter and the resulting values are added to the weights initialized above using the variable.assign_add(value / num_checkpoints) function (the embeddings layer is summed only with the embeddings layer, etc.)
The averaging mechanism of a small layer from our example model, with averaging of the last two checkpoints.