Extensive efforts have been made to improve the generalization ability of Reinforcement Learning (RL) methods via domain randomization and data augmentation. However, as more factors of variation are introduced during training, the optimization process becomes increasingly more difficult, leading to low sample efficiency and unstable training. Instead of learning policies directly from augmented data, we propose SOft Data Augmentation (SODA), a method that decouples augmentation from policy learning. Specifically, SODA imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data, while the RL optimization process uses strictly non-augmented data. Empirical evaluations are performed on diverse tasks from DeepMind Control suite as well as a robotic manipulation task, and we find SODA to significantly advance sample efficiency, generalization, and stability in training over state-of-the-art vision-based RL methods.
SODA learns a self-supervised representation learning task jointly with the RL objective. In the representation learning, SODA learns to map augmented and non-augmented data to similar points in latent space, formulated as a simple latent prediction task. At the same time, the RL objective is optimized using strictly non-augmented data.
SODA demonstrates significantly improved generalization over previous methods, exhibits stable training, and has a sample efficiency that is comparable to the baseline SAC. Average return of SODA and baselines in the train (training) and color_hard (evaluation) environments is shown below. See our paper for comparisons to PAD, RAD, and CURL.
We additionally evaluate methods in robotic manipulation, where we find SODA to generalize to a variety of unseen environments.
DMControl Generalization Benchmark
With the release of SODA, we open-source a new benchmark for generalization in continuous control from pixels, namely two random colors and video backgrounds benchmarks, to accelerate research in RL from pixels. Both benchmarks are offered in easy and hard variants. Samples are shown below.
Additionally, the benchmark contains standardized implementations of SODA, PAD, RAD, CURL, and SAC, which allows researchers to quickly evaluate, compare, and extend prior work in a unified framework.