We present Diffusion-ES, a method which combines gradient-free optimization with trajectory denoising to optimize black-box non-differentiable objectives while staying in the data manifold.
Reward-gradient guided denoising has been recently proposed to generate trajectories that maximize both a differentiable reward function and the likelihood under the data distribution captured by a diffusion model. Reward-gradient guided denoising requires a differentiable reward function fitted to both clean and noised samples, limiting its applicability as a general trajectory optimizer. Diffusion-ES samples trajectories during evolutionary search from a diffusion model and scores them using a black-box reward function. It mutates high-scoring trajectories using a truncated diffusion process that applies a small number of noising and denoising steps, allowing for much more efficient exploration of the solution space.
Our method can be used to guide any diffusion model at test-time with any black-box reward function with zero retraining, zero model architecture assumptions, and only the ability to evaluate clean samples with the reward function.
We show that Diffusion-ES achieves state-of-the-art performance on nuPlan, an established closed-loop planning benchmark for autonomous driving. Diffusion-ES outperforms existing sampling-based planners, reactive deterministic or diffusion-based policies, and reward-gradient guidance. Additionally, we show that unlike prior guidance methods, our method can optimize non-differentiable language-shaped reward functions generated by few-shot LLM prompting. When guided by a human teacher that issues instructions to follow, our method can generate novel, highly complex behaviors, such as aggressive lane weaving, which are not present in the training data. This allows us to solve the hardest nuPlan scenarios which are beyond the capabilities of existing trajectory optimization methods and driving policies.
Diffusion-ES leverages gradient-free evolutionary search to perform reward-guided sampling from trained diffusion models. An initial population of trajectories is generated by sampling from our diffusion model. At each iteration, we score (clean) trajectories with our reward function, select the highest reward samples, and mutate them.
Our key insight is that we can leverage a truncated diffusion process to mutate trajectories while staying on the data manifold. We can run the first t steps of the forward diffusion process to get noised samples, and then run t steps of the reverse diffusion process to denoise the samples again. We only need a small fraction of the total number of diffusion steps to perform mutations this way, which makes our sampling-based optimization much more efficient.
We validate our approach by using black-box planning rewards to guide a trajectory diffusion model. Specifically, we adopt the scorer used in PDM-Closed, with small tweaks to handle more diverse trajectory proposals. Diffusion-ES achieves state-of-the-art performance in nuPlan, a closed-loop driving benchmark.
Our planner can navigate challenging urban driving scenarios with dense traffic, outperforming prior learning-based methods and matching rule-based planners. Unlike prior work, our method can navigate environments more assertively, performing unprotected turns and changing lanes without dense waypoint guidance.
Diffusion-ES can be used to optimize arbitrary reward functions at test-time without retraining. To highlight this capability, we use few-shot LLM prompting to synthesize novel reward functions which execute langauge instructions, and then optimize those reward functions online with our method. This allows us to execute arbitrary language instructions without additional training.
Similar to prior work, we expose a Python API which contains reward-shaping methods. These methods can be invoked to alter the behavior of the base reward function, e.g., adding a dense lane following reward. We provide paired examples of language instructions and corresponding programs which use the provided API. Then we can prompt an LLM (GPT-4) with those examples and novel language instructions to automatically generate programs at test-time.
We also report quantitative performance for instruction following on a suite of language controllability tasks. Task success is determined by whether the provided instruction is followed and the scenario objective is accomplished.
We find that although rule-based methods achieve strong results on the original nuPlan benchmark, they struggle with more complex scenarios which require changing lanes and driving assertively.
@misc{yang2024diffusiones,
title={Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous Driving and Zero-Shot Instruction Following},
author={Brian Yang and Huangyuan Su and Nikolaos Gkanatsios and Tsung-Wei Ke and Ayush Jain and Jeff Schneider and Katerina Fragkiadaki},
year={2024},
eprint={2402.06559},
archivePrefix={arXiv},
primaryClass={cs.LG}
}