This blog series will go through the basics of
reinforcement learning, starting from the Bellman Expectation Equation. I will
be explaining how it works specifically in the Gridworld, a simple game where
the goal is for the red square(agent) to move towards the blue circle (with
positive reward value) while avoiding the green triangles (with negative reward
value). The Gridworld game uses dynamic
programming, so that the agent (the square) acts sequentially, moving from grid
to grid to get to the reward (blue circle).
To write this post, I referenced the book,
“Reinforcement Learning with Python and Keras” in Korean) by Woongwon Lee et.
al. To find the full code, go to https://github.com/rlcode/reinforcementlearningkr/blob/master/1gridworld/1policyiteration/policy_iteration.py
The numbers at the bottom right of each grid are the state values. The
arrows show the policy(action). At the start, the agent doesn’t know what to
do, so it treats all actions equally. The R: (value) for each of the circle and
triangles is the reward value. I modified the code to show these arrows and
state values to help understanding.
State value function shows the expected rewards the agent is expected to
reap if it follows the path. It is expressed by this equation
The problem is, it becomes impractical to find the state value with this function the more future rewards you consider. So the Bellman Expectation Equation is used instead, using just the current and next state values, as shown below.
The Bellman Expectation Equation is transformed for use in coding, as
below:
When we first click on the “Evaluate” button, the screen changes like this.
Several state values change according to the Bellman Expectation Equation. It’s lines 1920:
The algorithm applies the equation
sequentially for each grid square left to right, top to bottom.
Let us zoom in to the grey area
highlighted above. Why is the state value of the red square 0.25? The below
picture shows the grey area before evaluation.
Before (left) and after (right) evaluation
The state value (vđ(s)) for the red
grid (the little number at the bottom) changes from 0.0 (left) to 0.25 (right).
I will explain why using the Bellman equation and the relevant code below.

The state value is
determined by the four grids around it. The Bellman equation finds the factors
that determine the value of each of the four grids (lines 1720), then adds all
of them together in a summation (line 15).

The policy, đ(as), is the
probability that the agent will act in that direction, given the state. The
agent can move all four directions (up, down, left, right). So the initial probability
of following each policy is 0.25.

The reward the agent
will get by acting on that policy is Rt+1. The reward
is 1 for the blue square and 1 for the triangles but 0 for the rest. (line 17)

đžvđ(s') is the state value of each of the four grids
multiplied by the discount factor. The discount factor diminishes the reward
the further it is. It’s 0.9 for all four grids. But đžvđ(s') is zero everywhere. (line 18)

I assumed the agent
would always follow the policy with no mistakes. This is called Deterministic,
with 100% certain state transition probability. So I assumed the value ∑P(a,s,s') is 1, and omitted it from the Bellman
Equation.
Then once we click on the “Improve”
button, the most rewarded pathways are found, and the other pathways are
removed:
The (reward + discount factor*next state value) for
each action is found (line 3741), then the policies with the highest values
are found (lines 4449). The policy value in the Bellman Equation, đ(as), is changed (line 52). So the agent can follow
better policies that help it find the reward better.
Screenshots after another evaluation (left) and
improvement (right)
As we alternate evaluation and improvement repeatedly, the
algorithm slowly finds the most optimum policy for the agent. During evaluation
the state values are changed, and during improvement the policy values (đ(as)) are changed based on that, which then again affects the state values. Then
once the best policy is reached, the resulting value is expressed by the
Bellman optimality equation.
The video below shows the process of evaluation and improvement: