top of page
Justin Fu

Building Better Benchmarks for Offline Reinforcement Learning

In the last decade, one of the biggest drivers for success in machine learning has arguably been the rise of high-capacity models such as neural networks along with large datasets such as ImageNet to produce accurate models. While we have seen deep neural networks being applied to success in reinforcement learning (RL) in domains such as robotics, poker, board games, and team-based video games, a significant barrier to getting these methods working on real-world problems is the difficulty of large-scale online data collection. Not only is online data collection time-consuming and expensive, it can also be dangerous in safety-critical domains such as driving or healthcare. For example, it would be unreasonable to allow reinforcement learning agents to explore, make mistakes, and learn while controlling an autonomous vehicle or treating patients in a hospital. This makes learning from pre-collected experience enticing, and we are fortunate in that many of these domains, there already exist large datasets for applications such as self-driving cars, healthcare, or robotics. Therefore, the ability for RL algorithms to learn offline from these datasets (a setting referred to as offline or batch RL) has an enormous potential impact in shaping the way we build machine learning systems for the future.

The predominant method for benchmarking offline deep RL has been limited to a single scenario: the dataset is generated from some random or previously trained policy, and the goal of the algorithm is to improve in performance over the original policy [i.e., 1,2,3,4,5,6]. The problem with this approach is that real-world datasets are unlikely to be generated by a single RL-trained policy, and the many of the situations not covered by this evaluation method are unfortunately known to be problematic for RL algorithms. This makes it difficult to know how well our algorithms will perform when actually used outside of these benchmark tasks.

In order to develop effective algorithms for offline RL, we need widely available benchmarks that are easy to use and can accurately measure progress on this problem. Using real-world data, such as in autonomous driving, would make a great indicator for progress, but evaluation of the algorithm becomes a challenge. Most research labs do not have the resources to deploy their algorithm on a real vehicle in order to test if their method really works. To fill the gap between realistic but infeasible real-world tasks, and the somewhat lacking but easy-to-use simulated tasks, we recently introduced the D4RL benchmark (Datasets for Deep Data-Driven Reinforcement Learning) for offline RL. The goal of D4RL is simple: we propose tasks that are designed to exercise dimensions of the offline RL problem which may make real-world application difficult, while keeping the entire benchmark in simulated domains that allow any researcher around the world to efficiently evaluate their method. In total, the D4RL benchmark includes over 40 tasks across 7 qualitatively distinct domains that cover application areas such as robotic manipulation, navigation, and autonomous driving.

bottom of page