Is using a validation set useful for end-to-end learning in robotics?
Introduction
In classical supervised learning, it is common during training to compute metrics like accuracy for classification or mean squared error for regression on a held-out validation set. These metrics are strong indicators of a model capability to generalize to unseen inputs collected in the same context as the training set. Thus, they are used to select the best model checkpoint or to "early-stop" training.
However, in the context of end-to-end imitation learning for real-world robotics, there is no clear consensus among practicioners on the best metrics and practices for using a validation set to select the best checkpoint. This is because the metric that roboticists aim to optimize is the success rate. In other words, the percentage of successful tries in which the robot accomplished the task. It requires running the policy on the robot in the test environement for a long period of time to ensure a low variance caused external factors. For instance, the light conditions can shift, the room layouts can change from day to day, the dynamics of robot motors can change due to usage, etc. More importantly, success rate can not be computed on a validation set. Only the validation loss or other proxy metrics such as the mean squared error in action space can be computed.
Since computing success rate on each model checkpoint is too costly, some practicioners recommend using the validation loss to select the best checkpoint. For instance, ACT and Aloha authors Zhao et. al. indicate that "at test time, we load the policy that achieves the lowest validation loss and roll it out in the environment". On the other hand, Stanford Robomimic authors noticed "that the best [policy early stopped on validation loss] is 50 to 100% worse than the best performing policy [when we evaluate all the checkpoints]", which suggests that selecting the checkpoint with the lowest validation loss does not ensure the highest success rate.
A few hypothesis could explain why low validation loss is not predictive of a high sucess rate. First, there might be a shift in distribution between data collected during training through human teleoperation, and data obtained during evaluation through the policy controlling the robot. This shift can be due to all possible changes in the environment or robot that we previously listed, but also due to slight prediction errors of the policy that accumulate over time and move the robot outside common trajectories. As a result, the inputs seen by the policy during evaluation can be quite different from the ones seen during training. In this context, computing a validation loss might not be helpful since the loss function is used to optimize copying a human demonstrator trajectory, but it does not account for capability to generalize outside of the training distribution. It does not directly optimize the success rate of completing a task in a possibly changing environment.
In this study, we will explore if the validation loss can be used to select the best checkpoint associated with the highest success rate. If it turns out to not be the case, computing a validation loss on a held-out subset of the training set could be useless and may even hurt performance, since training is done on a smaller part of the training set. We will also discuss the alternatives to using a validation loss. Our experiments are conducted in two commonly used simulation environments, PushT and Aloha Transfer Cube, with two different policies, respectively Diffusion and ACT (Action chunking with transformers). Simulation allows us to accurately compute the success rate at every 10K checkpoints, which is challenging in real environments as explained earlier.
PushT
Experimental Setup
Fig. 1: PushT Environment
The diffusion policy was trained on the PushT dataset, with 206 episodes at 10 FPS, yielding a total of 25,650 frames with an average episode duration of 12 seconds.
We use the same hyperparameters as the authors of Diffusion Policy. We train the policy with three different seeds, then compute the naive mean of each metric.
During training, evaluation is done in simulated environments every 10K steps. We roll out the policy for 50 episodes and calculate the success rate.
Training for 100K steps plus evaluation every 10K steps took about 5 hours on a standard GPU. Running evaluation and calculating success rates is the most costly part, taking on average 15 minutes at each batch rollout.
Quantitative Results
We compute the Diffusion validation loss on the output of the denoising network. It is the loss between the predicted noise and actual noise used as the training loss. We also compute a more explicit metric to assess the performance of the policy for action prediction, the Mean Squared Error (MSE). We compare N Action Steps worth of predicted actions and ground truth actions. We replicate the process of action selection that is carried out during inference with a queue of observations and actions.
We notice a divergent pattern for validation loss with regards to the success rate, and no correlation between the MSE and the success rate.
Fig. 2: PushT Validation Loss
Fig. 3: PushT Mean Squared Error
Fig. 4: PushT Success Rate
From the first 10,000 steps and until 60,000 steps, validation loss continuously increases, and does not recover to its minimum level by the end of training. In contrast, despite the continuous increase in validation loss, the success rate consistently improves between those steps across all seed runs.
The variations of the mean squared error cannot be used as a reliable point of reference as well. The MSE increases between 40K and 60K steps, but the success rate improves, which contradicts the usual association between lower MSE and higher performance that is seen in classical supervised learning. The MSE decreases between 60K and 70K and increases between 70K and 80K, but for both of those intervals, the success rate falls.
This only shows that no clear signal can be inferred from the changes in the action prediction loss. This holds especially true since the standard deviation (Std) of the MSE Loss at a given step can have the same magnitude as the changes in MSE throughout steps.
We confirm these results by running costly evaluations on 500 episodes to have more samples and decrease variance. To confirm that there's no correlation between the validation loss and success rate, we evaluate the checkpoints at 20K steps, at 50K steps, and at 90K steps. (Fig. 5 ) We show the changes relatively to the first column.
Step | 20K steps | 50K steps | 90K steps |
---|---|---|---|
Success Rate (%) | 40.47 | +55.27% | +25.73% |
Validation Loss | 0.0412 | +134.57% | +35.94% |
Fig. 5: PushT success rate and denoising validation loss across steps averaged over 3 seeds
The validation losses are more than twice as high after 50K training steps than after 20K training steps, while the success rate improve by over 50% on average. Furthermore, the validation loss decreases between 50K and 90K steps, but the success rate decreases as well.
This suggests limitations of using only validation loss to interpret policy performance.
The variations of the MSE loss are not indicators of evaluation success rate either.
To confirm that there is no correlation between the MSE and success rate, we evaluate the checkpoints at 40K steps, at 60K steps, and at 80K steps. (Fig. 6 ) We show the changes accross steps relatively to the first column.
Step | 40K steps | 60K steps | 90K steps |
---|---|---|---|
MSE Loss | 0.02023 | +3.22% | +2.66% |
PC Success (%) | 61.08 | +2.73% | -17.82% |
Fig. 6: PushT success rate and MSE loss across steps averaged over 3 seeds
These findings suggest that monitoring metrics alone may not be sufficient for predicting performance in end-to-end imitation learning, nor be used to make informed judgments about stopping training or no.
Qualitative Results
During training, policy adapts well to handle smooth trajectory planning.
Fig. 7: PushT original example from training set
Fig. 8: PushT rollout episode rendered at a higher resolution
We notice that the policy becomes less jerky with the number of training steps and adapts better to out-of-distribution states. It is also able to plan longer trajectories and predict actions that are more precise in term of distance from current position to next position.
Fig. 10 and Fig. 11 have the same starting position, but the policy is only able to match the exact T position at the 80K step count.
Fig. 9: PushT Diffusion Policy after 20K steps
Fig. 10: PushT Diffusion Policy after 50K steps
Fig. 11: PushT Diffusion Policy after 80K steps
But even at 90K training steps, there are still some failure cases:
Fig. 12: PushT failure case
Transfer Cube
Experimental Setup
In the second simulation, we use the Aloha arms environment on the Transfer-Cube task, with 50 episodes of human-recorded data. Each episode consists of 400 frames at 50 FPS, resulting in 8-second episodes captured with a single top-mounted camera.
Fig. 13: Aloha Transfer-Cube Environment
We use the same hyperparameters as the authors of ACT. Same as for PushT, we train the policy with three different seeds.
Training for 100K steps + evaluation every 10K steps took about 6 hours on a standard GPU. Running evaluation and calculating success rates is still the most costly part, in this task taking on average 20 minutes at each batch rollout.
Quantitative Results
In the case of Transfer Cube, we compute the validation loss. We also replicate the process of action selection during inference and compute the MSE on the predicted N Action Steps worth of action every 10,000 steps. Here N Action Steps is equal to 100 and the policy predicts 100 actions at a time.
We notice that while the validation loss plateaus, the success rate continues to grow. We also notice that the variations of the MSE loss are not synchronized with those of the success rate and too variant to be relevant.
The success rate computed during training is highly variant (average of only 50 evaluation episodes) and cannot be conclusive, which is why we run additional evaluations on 500 episodes.
To confirm that there is no correlation between the validation loss and success rate, we calculate the success rate at 30K steps, 70K steps and at 100K steps. (Fig. 17 ) We show the changes relatively to the first column.
Step | 30K steps | 70K steps | 100K steps |
---|---|---|---|
Success Rate (%) | 53.33 | +12.94% | +16.67% |
Validation Loss | 0.2289 | -2.04% | -2.03% |
Fig. 17: Transfer Cube success rate and validation loss averaged over 3 seeds
So while the validation loss stays roughly the same, or decreases by 2%, the success rate increases by more than 15%. It is challenging to early-stop based on such fine-grained signals; for our task it doesn't appear to be effective.
We run additional evaluations at 50K and 60K steps to confirm that there is no correlation between the MSE loss and the success rate. (Fig. 18 ) We show the changes relatively to the first column.
Step | 30K steps | 50K steps | 60K steps |
---|---|---|---|
Mean Success Rate | 53.33 | 55.65 (+4.35%) | 63.22 (+18.54%) |
MSE Loss | 0.8178 | 0.8153 (-0.31%) | 0.8156 (-0.27%) |
Fig. 18: Transfer Cube success rate and MSE loss averaged over 3 seeds
While the MSE loss does not differ much at every evaluated checkpoint, there is stable improvement in the performance of the model.
Qualitative Results
We notice that the policy is good at completing unseen trajectories and adapting to out-of-distribution data.
But when rolling out the policy upon evaluation, we notice that in many episodes the readjustment of the trajectory happens when the arm is already starting to rise up. This is probably due to the fact that we only train using one top camera, and the policy does not have a good perception of depth, therefore it misjudges cube distances during rollout. The policy often readjusts multiple times, which shows robustness to out-of-distribution data.
Fig. 20: Episode with multiple trajectory adjustments
In some cases, the robot fails to grasp the cube, even after a few attempts.
Fig. 21: Failure case
While there aren't any informative differences in the losses between 50K and 90K steps, there is improvement in the smoothness of the trajectory:
Fig 22: Aloha ACT after 50K steps
Fig 23: Aloha ACT after 90K steps
Conclusion
Our experiments reveal a significant discrepancy between validation loss and task success rate metrics. On our tasks, it is clear that we should not use validation loss to early stop training. This strategy does not ensure the highest success rate. Further studies could be done to exhibit the behaviour of models trained for longer, as it could possibly serve to reduce variance in losses and success rates. In our case, we trained the model until baseline success rate on the given architecture was reached.
In the real world, it is extremely costly to assess the success rate of a given checkpoint with low variance. It surely can not be done at every checkpoint while training. Instead, we advise to run a few evaluation and mainly focus on a qualitative assessment such as the learning of new capabilities and the fluidity of the robot's movements. When no more progress is noticed, the training can be stopped.
For instance, when training PollenRobotics' Reachy2 (see demo) to grab a cup and place it on a rack, then grab an apple and give to the hand of a person sitting on the opposite side, and rotate back to the initial position ; we noticed that the policy gradually learned more advanced concepts and trajectories:
- At checkpoint 20k, the robot was only able to grasp the cup, but it was failing to place it on the rack.
- At checkpoint 30k, it learned to place it smoothly on the rack, but was not grasping the apple.
- At checkpoint 40k, it learned to grasp the apple but was not rotating.
- At checkpoint 50k, it learned to rotate and give the apple, but it was not rotating back.
- Finally, it learned to rotate back into the desired final position and complete the full trajectory.
Doing frequent small qualitative assessments is an efficient method to spot bugs, get a feel of the policy capabilities and stability from one checkpoint to the others, and get inspired on ways to improve it.
In addition, more involved approaches consist in evaluating a policy trained on real data in a realistic simulation, and using the simulation success rate as a proxy to real success rate. These approaches are challenging since they require thorough modeling of robots, environements and tasks with a highly realistic physical engine. As we improve our simulations, scaling these approaches could lead to more efficient training and reduce time/resource costs of evaluation in the real world. In this line of work, we can cite:
- Li et al. 2024: Evaluating Real-World Robot Manipulation Policies
- Li et al. 2024: Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics