An important, emerging branch of Machine Learning is Reinforcement Learning (RL). In RL, the Machine Learns which action to take in order to maximize its reward; it can be a physical action, like a robot moving an arm, or a conceptual action, like a computer game selecting which chess piece to move and where to move it. For example, in a chess game, RL analyzes the corresponding chess board and suggests a move that maximizes the probability of winning.
One of the challenges with RL is how to balance exploiting rewards and exploring the environment in order to achieve robust, generalizable learning. This blog is based on a paper accepted at ICML 2019 as an oral presentation. Authored by Shauharda Khadka, Somdeb Majumdar, Santiago Miret and other researchers from Intel AI Lab and Oregon State University, the paper presents our proposed solution to the exploit/explore challenge, which we call Collaborative Evolutionary Reinforcement Learning (CERL).
Reinforcement learning involves training a neural network, often called a policy network, to map observations in the environment to a set of actions at every step. Training is done by learning to associate actions with positive/negative outcomes. The networks are initialized randomly and produce noisy policies at first. However, as they explore different policies and try out various actions, networks learn to produce policies that have a higher probability of getting positive rewards.
Policy gradient-based RL methods are commonly used by AI researchers today. While these methods are able to exploit rewards for learning, they suffer from limited exploration and costly gradient computations. Another well-known approach is the Evolutionary Algorithm (EA), which attempts to mimic the process of a natural evolution in addressing RL problems. EA is a population-based, gradient-free algorithm that can handle sparse rewards and potentially improve exploration. Strong candidates are selected at every generation based on an evaluation condition – and in turn, generate new candidates for future generations. A downside is that the evolution method takes significant processing time because the candidates are only evaluated at the end of a complete episode.
The core ML dichotomy is revealed again: the choice to either explore the world to get more information while sacrificing short-term gains or to exploit the current state of knowledge towards improving performance.
CERL: A Unique Approach
To resolve this conflict, we’ve developed a new approach to RL called Collaborative Evolutionary Reinforcement Learning (CERL) that combines policy gradient and evolution methods to optimize the exploit/explore challenge.
Figure 1: High-level CERL Schematic.
On the left, policy gradient learners (shown as L1 – Lk policy networks) learn a task based on rewards over a varying set of time-horizons. They typically use proven RL algorithms such as TD3. Policy gradient learners can start learning quickly as they integrate rewards over a shorter time-horizon compared to the complete task.
At the beginning of the task, the replay buffer is empty. During processing, the data set of the actual state-action-reward is stored in the replay buffer. In traditional ML, there is one replay buffer for each network. In CERL, we share the replay buffer, which allows all policy net across all populations to learn from each others’ experiences. Since this allows each network to train using data from every other network, this tremendously speeds up exploration compared to the traditional approach.
The resource manager provides an additional sample-efficiency knob by probabilistically allocating compute and data resources in proportion to the cumulative rewards of each learner.
On the right, the neuroevolution block shows that the actors are evaluated on a given task and ranked. The actors are ranked based on their performance. A portion of the actors is probabilistically selected where the probability of selection is proportionate with their rank. The rest of the actors are discarded. We also perform mutation (cloning with small perturbations) and crossover (linearly combining parameters) on the elites to generate high performing offspring to backfill the discarded networks. The emergent learner is the top performing network in the evolutionary population. This allows us to handle rewards over arbitrarily long time-horizons and maintain a large diversity of solutions simultaneously – a crucial limitation of policy-gradient learners.
In typical RL solutions, algorithms are designed to use hyper-parameters: for example, network topology and training parameters such as learning rate. We did not tune any hyper-parameters in our CERL solution, which freed up to design overhead for new environments we can test in.
Read More: How AI Is Transforming The Role Of The CMO
Our biggest gains came from our key insight to share the experiences generated by the portfolio of policy gradient learners and neuroevolution actors. Further, the portfolio of learners is also optimized across varying resolutions of the same underlying task. This enables them to reach parts of the search space that they would not have reached by themselves. The CERL implementation facilitates collective exploitation of these diverse experiences by enabling solutions that are not feasible by any individual algorithm on its own. Using CERL, we were able to solve robotics benchmarks using fewer cumulative training samples than using traditional methods that rely on gradient-based or evolutionary learning alone. Since our solution involves a memory intensive process of a large population of networks performing forward propagation, it was able to take advantage of large CPU clusters rather than memory-limited GPUs.
We tested CERL with these standard academic benchmarks: Humanoid, Hopper, Swimmer, HalfCheetah, and Walker2D. The most complex benchmarks involve continuous control tasks, where a robot produces actions that are not discrete. For example, the OpenAI Gym Humanoid benchmark requires a 3D humanoid model to learn to walk forward as fast as possible without falling. A cumulative reward reflects the level of success for this task. This problem is made difficult due to the relatively large state space as well as a continuous action space – since the walking speed can be selected from a continuous range of values.
Until recently, the Humanoid benchmark was unsolved (robots could learn to walk, but they couldn’t keep up a sustained walk). Our team solved it using CERL and achieved a score of about 4,702 in 1M time steps. Another team from UC Berkeley recently solved the Humanoid benchmark using a complementary approach and we are working to combine both teams’ approaches.
The CERL solution fits any large-scale decision-making problem that solves a complex task. It is especially suited to tasks that have multiple hierarchies of requirements like Physical Robotics, complex games, and autonomous driving, among others. Our team is currently working on applying CERL to automatically map various parts of a neural network to various memory and compute units on processors and thus satisfy multiple performance specifications simultaneously.