# Temporal Difference Learning for Model Predictive Control

#### Nicklas Hansen,  Xiaolong Wang*,  Hao Su*UC San Diego*Equal advising

We present TD-MPC, a framework for model predictive control (MPC) using a Task-Oriented Latent Dynamics (TOLD) model and terminal value function learned jointly by temporal difference learning. Our method compares favorably to prior model-free and model-based methods and solves high-dimensional Humanoid and Dog locomotion tasks in 1M environment steps (see above). This is, to the best of our knowledge, the first documented result solving the challenging Dog tasks.

## Abstract

Data-driven model predictive control has two key advantages over model-free methods: a potential for improved sample efficiency through model learning, and better performance as computational budget for planning increases. However, it is both costly to plan over long horizons and challenging to obtain an accurate model of the environment. In this work, we combine the strengths of model-free and model-based methods. We use a learned task-oriented latent dynamics model for local trajectory optimization over a short horizon, and use a learned terminal value function to estimate long-term return, both of which are learned jointly by temporal difference learning. Our method, TD-MPC, achieves superior sample efficiency and asymptotic performance over prior work on both state and image-based continuous control tasks from DMControl and Meta-World.

## Dog

TD-MPC solves challenging Dog tasks with 38-dimensional continuous action spaces. Soft Actor-Critic (SAC) fails to learn any meaningful behavior.

Run (SAC)
Run (ours)
Walk
Trot

## Humanoid

TD-MPC solves high-dimensional Humanoid locomotion tasks in 1M environment steps. SAC converges considerably slower and achieves a lower asymptotic performance.

Run (SAC)
Run (ours)
Stand
Walk

## DMControl

We evaluate TD-MPC on 23 diverse continuous control tasks from DMControl; trajectories for 8 representative tasks are shown below. TD-MPC solves a wide variety of control tasks in less than an hour on a single GPU.

## Meta-World

We also benchmark TD-MPC on 50 goal-conditioned tasks from Meta-World; trajectories for 8 representative tasks are shown below. TD-MPC consistently matches or outperforms SAC across all tasks.

## Multi-modal RL

Unlike prior MPC-based methods, TD-MPC is agnostic to the choice of input modality. We trivially extend TD-MPC to a multi-modal setting, where the agent navigates using both proprioceptive data and an egocentric camera.

Obstacles (Blind)
Obstacles
Corridor

## Planning with TD-MPC

TD-MPC is a framework for MPC using a Task-Oriented Latent Dynamics (TOLD) model and terminal value function, both learned jointly by temporal difference learning. During inference, we perform trajectory optimization with Model Predictive Path Integral (MPPI) control over (latent) model rollouts and use the value function for long-term return estimates. Our method additionally uses a learned policy $\pi_{\theta}$ that is used to guide planning, and we propose several other extensions to the MPC framework that leverage ideas from model-free RL; see our paper for details.

## Training a Task-Oriented Latent Dynamics Model

Training. A trajectory of length $H$ is sampled from a replay buffer, and the first observation is encoded into a latent representation $\mathbf{z}_{0}$. Then, the Task-Oriented Latent Dynamics (TOLD) model recurrently predicts the following latent states $\mathbf{z}_{1}, \mathbf{z}_{2},\dots,\mathbf{z}_{H}$, as well as values, rewards, and actions for each latent state, and we optimize the TOLD model using temporal difference learning, without any reconstruction. Subsequent observations are encoded using a target net and used as latent targets only during training (illustrated in gray).

## Benchmark Results

DMControl. We compare our method (TD-MPC) to SAC (model-free), LOOP (hybrid model-free/model-based), MPC (model-based) with access to a ground-truth model (simulator), and two ablations: (i) our method without latent dynamics, and (ii) our method without our proposed latent state consistency regularization. Experiments are conducted on 23 diverse state-based tasks from DMControl, 5 image-based tasks from DMControl, as well as 50 goal-conditioned manipulation tasks from Meta-World (see our paper). TD-MPC achieves superior sample-efficiency and asymptotic performance compared to baselines, and successfully solves challenging Dog tasks with 38-dimensional continuous action spaces.

Learning from pixels. With trivial modifications to our method, we match the performance of state-of-the-art image-based methods on the sample-efficient DMControl 100k benchmark. MuZero/EfficientZero here use a discretized action space.

## Comparison to Related Work

Below, we compare key components of TD-MPC to prior model-based and model-free methods. Model objective describes which objective is used to learn a (latent) dynamics model, value denotes whether a value function is learned, inference provides a simplified overview of action selection at inference time, continuous denotes whether an algorithm can handle continuous action spaces, and compute is a simplified estimate of the relative computational cost of methods during training and inference.

## Paper

N. Hansen, X. Wang, H. Su
Temporal Difference Learning for Model Predictive Control

ICML 2022

## Citation

If you use our method or code in your research, please consider citing the paper as follows:

@inproceedings{Hansen2022tdmpc, title={Temporal Difference Learning for Model Predictive Control}, author={Nicklas Hansen and Xiaolong Wang and Hao Su}, booktitle={ICML}, year={2022} }