Friday, March 8, 2019

Reinforcement Learning Series, #1: Policy Iteration - The Bellman Expectation Equation in the Dynamic Program

This blog series will go through the basics of reinforcement learning, starting from the Bellman Expectation Equation. I will be explaining how it works specifically in the Gridworld, a simple game where the goal is for the red square(agent) to move towards the blue circle (with positive reward value) while avoiding the green triangles (with negative reward value). The Gridworld game uses dynamic programming, so that the agent (the square) acts sequentially, moving from grid to grid to get to the reward (blue circle).

To write this post, I referenced the book, “Reinforcement Learning with Python and Keras” in Korean) by Woongwon Lee et. al. To find the full code, go to

The numbers at the bottom right of each grid are the state values. The arrows show the policy(action). At the start, the agent doesn’t know what to do, so it treats all actions equally. The R: (value) for each of the circle and triangles is the reward value. I modified the code to show these arrows and state values to help understanding.

State value function shows the expected rewards the agent is expected to reap if it follows the path. It is expressed by this equation

The problem is, it becomes impractical to find the state value with this function the more future rewards you consider. So the Bellman Expectation Equation is used instead, using just the current and next state values, as shown below.

The Bellman Expectation Equation is transformed for use in coding, as below:

When we first click on the “Evaluate” button, the screen changes like this. Several state values change according to the Bellman Expectation Equation. It’s lines 19-20:

The algorithm applies the equation sequentially for each grid square left to right, top to bottom.

Let us zoom in to the grey area highlighted above. Why is the state value of the red square 0.25? The below picture shows the grey area before evaluation.

Before (left) and after (right) evaluation

The state value (v𝛑(s)) for the red grid (the little number at the bottom) changes from 0.0 (left) to 0.25 (right). I will explain why using the Bellman equation and the relevant code below.

-       The state value is determined by the four grids around it. The Bellman equation finds the factors that determine the value of each of the four grids (lines 17-20), then adds all of them together in a summation (line 15).
-     The policy, 𝛑(a|s), is the probability that the agent will act in that direction, given the state. The agent can move all four directions (up, down, left, right). So the initial probability of following each policy is 0.25.
-       The reward the agent will get by acting on that policy is Rt+1. The reward is 1 for the blue square and -1 for the triangles but 0 for the rest. (line 17)
-        𝛾v𝛑(s') is the state value of each of the four grids multiplied by the discount factor. The discount factor diminishes the reward the further it is. It’s 0.9 for all four grids. But 𝛾v𝛑(s') is zero everywhere. (line 18)
-       I assumed the agent would always follow the policy with no mistakes. This is called Deterministic, with 100% certain state transition probability. So I assumed the value ∑P(a,s,s') is 1, and omitted it from the Bellman Equation.

Then once we click on the “Improve” button, the most rewarded pathways are found, and the other pathways are removed:

Notice some of the arrows have vanished. Looking at the code, we can understand why:

The (reward + discount factor*next state value) for each action is found (line 37-41), then the policies with the highest values are found (lines 44-49). The policy value in the Bellman Equation, 𝛑(a|s), is changed (line 52). So the agent can follow better policies that help it find the reward better.

Screenshots after another evaluation (left) and improvement (right)

As we alternate evaluation and improvement repeatedly, the algorithm slowly finds the most optimum policy for the agent. During evaluation the state values are changed, and during improvement the policy values (𝛑(a|s)) are changed based on that, which then again affects the state values. Then once the best policy is reached, the resulting value is expressed by the Bellman optimality equation.

The video below shows the process of evaluation and improvement: