Is using a validation set useful for end-to-end learning in robotics?

Community Article Published December 1, 2024

Introduction

PushT
Experimental Setup

Quantitative Results

Qualitative Results

Transfer Cube
Experimental Setup

Quantitative Results

Qualitative Results

Conclusion

References

Introduction

In classical supervised learning, it is common during training to compute metrics like accuracy for classification or mean squared error for regression on a held-out validation set. These metrics are strong indicators of a model capability to generalize to unseen inputs collected in the same context as the training set. Thus, they are used to select the best model checkpoint or to "early-stop" training.

However, in the context of end-to-end imitation learning for real-world robotics, there is no clear consensus among practicioners on the best metrics and practices for using a validation set to select the best checkpoint. This is because the metric that roboticists aim to optimize is the success rate. In other words, the percentage of successful tries in which the robot accomplished the task. It requires running the policy on the robot in the test environement for a long period of time to ensure a low variance caused external factors. For instance, the light conditions can shift, the room layouts can change from day to day, the dynamics of robot motors can change due to usage, etc. More importantly, success rate can not be computed on a validation set. Only the validation loss or other proxy metrics such as the mean squared error in action space can be computed.

Since computing success rate on each model checkpoint is too costly, some practicioners recommend using the validation loss to select the best checkpoint. For instance, ACT and Aloha authors Zhao et. al. indicate that "at test time, we load the policy that achieves the lowest validation loss and roll it out in the environment". On the other hand, Stanford Robomimic authors noticed "that the best [policy early stopped on validation loss] is 50 to 100% worse than the best performing policy [when we evaluate all the checkpoints]", which suggests that selecting the checkpoint with the lowest validation loss does not ensure the highest success rate.

A few hypothesis could explain why low validation loss is not predictive of a high sucess rate. First, there might be a shift in distribution between data collected during training through human teleoperation, and data obtained during evaluation through the policy controlling the robot. This shift can be due to all possible changes in the environment or robot that we previously listed, but also due to slight prediction errors of the policy that accumulate over time and move the robot outside common trajectories. As a result, the inputs seen by the policy during evaluation can be quite different from the ones seen during training. In this context, computing a validation loss might not be helpful since the loss function is used to optimize copying a human demonstrator trajectory, but it does not account for capability to generalize outside of the training distribution. It does not directly optimize the success rate of completing a task in a possibly changing environment.

In this study, we will explore if the validation loss can be used to select the best checkpoint associated with the highest success rate. If it turns out to not be the case, computing a validation loss on a held-out subset of the training set could be useless and may even hurt performance, since training is done on a smaller part of the training set. We will also discuss the alternatives to using a validation loss. Our experiments are conducted in two commonly used simulation environments, PushT and Aloha Transfer Cube, with two different policies, respectively Diffusion and ACT (Action chunking with transformers). Simulation allows us to accurately compute the success rate at every 10K checkpoints, which is challenging in real environments as explained earlier.

PushT

Experimental Setup

Fig. 1: PushT Environment

The diffusion policy was trained on the PushT dataset, with 206 episodes at 10 FPS, yielding a total of 25,650 frames with an average episode duration of 12 seconds.

We use the same hyperparameters as the authors of Diffusion Policy. We train the policy with three different seeds, then compute the naive mean of each metric.

During training, evaluation is done in simulated environments every 10K steps. We roll out the policy for 50 episodes and calculate the success rate.

Training for 100K steps plus evaluation every 10K steps took about 5 hours on a standard GPU. Running evaluation and calculating success rates is the most costly part, taking on average 15 minutes at each batch rollout.

Quantitative Results

We compute the Diffusion validation loss on the output of the denoising network. It is the loss between the predicted noise and actual noise used as the training loss. We also compute a more explicit metric to assess the performance of the policy for action prediction, the Mean Squared Error (MSE). We compare N Action Steps worth of predicted actions and ground truth actions. We replicate the process of action selection that is carried out during inference with a queue of observations and actions.

We notice a divergent pattern for validation loss with regards to the success rate, and no correlation between the MSE and the success rate.

Fig. 2: PushT Validation Loss

Fig. 3: PushT Mean Squared Error

Fig. 4: PushT Success Rate

From the first 10,000 steps and until 60,000 steps, validation loss continuously increases, and does not recover to its minimum level by the end of training. In contrast, despite the continuous increase in validation loss, the success rate consistently improves between those steps across all seed runs.

The variations of the mean squared error cannot be used as a reliable point of reference as well. The MSE increases between 40K and 60K steps, but the success rate improves, which contradicts the usual association between lower MSE and higher performance that is seen in classical supervised learning. The MSE decreases between 60K and 70K and increases between 70K and 80K, but for both of those intervals, the success rate falls.

This only shows that no clear signal can be inferred from the changes in the action prediction loss. This holds especially true since the standard deviation (Std) of the MSE Loss at a given step can have the same magnitude as the changes in MSE throughout steps.

We confirm these results by running costly evaluations on 500 episodes to have more samples and decrease variance. To confirm that there's no correlation between the validation loss and success rate, we evaluate the checkpoints at 20K steps, at 50K steps, and at 90K steps. (Fig. 5 ) We show the changes relatively to the first column.

Step	20K steps	50K steps	90K steps
Success Rate (%)	40.47	+55.27%	+25.73%
Validation Loss	0.0412	+134.57%	+35.94%

Fig. 5: PushT success rate and denoising validation loss across steps averaged over 3 seeds

The validation losses are more than twice as high after 50K training steps than after 20K training steps, while the success rate improve by over 50% on average. Furthermore, the validation loss decreases between 50K and 90K steps, but the success rate decreases as well.

This suggests limitations of using only validation loss to interpret policy performance.

The variations of the MSE loss are not indicators of evaluation success rate either.

To confirm that there is no correlation between the MSE and success rate, we evaluate the checkpoints at 40K steps, at 60K steps, and at 80K steps. (Fig. 6 ) We show the changes accross steps relatively to the first column.

Step	40K steps	60K steps	90K steps
MSE Loss	0.02023	+3.22%	+2.66%
PC Success (%)	61.08	+2.73%	-17.82%

Fig. 6: PushT success rate and MSE loss across steps averaged over 3 seeds

These findings suggest that monitoring metrics alone may not be sufficient for predicting performance in end-to-end imitation learning, nor be used to make informed judgments about stopping training or no.

Qualitative Results

During training, policy adapts well to handle smooth trajectory planning.

Fig. 7: PushT original example from training set

Fig. 8: PushT rollout episode rendered at a higher resolution

We notice that the policy becomes less jerky with the number of training steps and adapts better to out-of-distribution states. It is also able to plan longer trajectories and predict actions that are more precise in term of distance from current position to next position.

Fig. 10 and Fig. 11 have the same starting position, but the policy is only able to match the exact T position at the 80K step count.

Fig. 9: PushT Diffusion Policy after 20K steps

Fig. 10: PushT Diffusion Policy after 50K steps

Fig. 11: PushT Diffusion Policy after 80K steps

But even at 90K training steps, there are still some failure cases:

Fig. 12: PushT failure case