Introduction to Reinforcement Learning

This article will covers Reinforcement Learning. This technique is different than many of the other machine learning techniques we have and has many applications in training agents (an AI) to interact with environments like games. Rather than feeding our machine learning model millions of examples we let our model come up with its own examples by exploring an environemt.

The concept is simple. Humans learn by exploring and learning from mistakes and past experiences so let’s have our computer do the same.


Before we dive into explaining reinforcement learning we need to define a few key pieces of terminology.

Environment In reinforcement learning tasks we have a notion of the environment. This is what our agent will explore. An example of an environment in the case of training an AI to play say a game of mario would be the level we are training the agent on.

Agent an agent is an entity that is exploring the environment. Our agent will interact and take different actions within the environment. In our mario example the mario character within the game would be our agent.

State always our agent will be in what we call a state. The state simply tells us about the status of the agent. The most common example of a state is the location of the agent within the environment. Moving locations would change the agents state.

Action any interaction between the agent and environment would be considered an action. For example, moving to the left or jumping would be an action. An action may or may not change the current state of the agent. In fact, the act of doing nothing is an action as well! The action of say not pressing a key if we are using our mario example.

Reward every action that our agent takes will result in a reward of some magnitude (positive or negative). The goal of our agent will be to maximize its reward in an environment. Sometimes the reward will be clear, for example if an agent performs an action which increases their score in the environment we could say they’ve recieved a positive reward. If the agent were to perform an action which results in them losing score or possibly dying in the environment then they would receive a negative reward.

The most important part of reinforcement learning is determining how to reward the agent. After all, the goal of the agent is to maximize its rewards. This means we should reward the agent appropriately such that it reaches the desired goal.


Now that we have a vague idea of how reinforcement learning works it’s time to talk about a specific technique in reinforcement learning called Q-Learning.

Q-Learning is a simple yet quite powerful technique in machine learning that involves learning a matrix of action-reward values. This matrix is often referred to as a Q-Table or Q-Matrix. The matrix is in shape (number of possible states, number of possible actions) where each value at matrix[n, m] represents the agents expected reward given they are in state n and take action m. The Q-learning algorithm defines the way we update the values in the matrix and decide what action to take at each state. The idea is that after a successful training/learning of this Q-Table/matrix we can determine the action an agent should take in any state by looking at that states row in the matrix and taking the maximum value column as the action.

Consider this example.

Let’s say A1-A4 are the possible actions and we have 3 states represented by each row (state 1 — state 3).

If that was our Q-Table/matrix then the following would be the prefer actions in each state.

State 1: A3

State 2: A2

State 3: A1

We can see that this is because the values in each of those columns are the highest for those states!

Learning the Q-Table

So that’s simple, right? Now how do we create this table and find those values. Well this is where we will discuss how the Q-Learning algorithm updates the values in our Q-Table.

I’ll start by noting that our Q-Table starts of with all 0 values. This is because the agent has yet to learn anything about the environment.

Our agent learns by exploring the enviornment and observing the outcome/reward from each action it takes in each state. But how does it know what action to take in each state? There are two ways that our agent can decide on which action to take.

  • Randomly picking a valid action
  • Using the current Q-Table to find the best action.

Near the beginning of our agents learning it will mostly take random actions in order to explore the environment and enter many different states. As it starts to explore more of the environment it will start to gradually rely more on it’s learned values (Q-Table) to take actions. This means that as our agent explores more of the environment it will develop a better understanding and start to take “correct” or better actions more often. It’s important that the agent has a good balance of taking random actions and using learned values to ensure it does get trapped in a local maximum.

After each new action our agent wil record the new state (if any) that it has entered and the reward that it recieved from taking that action. These values will be used to update the Q-Table. The agent will stop taking new actions only once a certain time limit is reached or it has acheived the goal or reached the end of the enviornment.

Updating Q-Values

The formula for updating the Q-Table after each action is as follows:

  • α stands for the Learning Rate
  • γ stands for the Discount Factor

Learning Rate α

The learning rate α is a numeric constant that defines how much change is permitted on each QTable update. A high learning rate means that each update will introduce a large change to the current state-action value. A small learning rate means that each update has a more subtle change. Modifying the learning rate will change how the agent explores the environment and how quickly it determines the final values in the QTable.

Discount Factor γ

Discount factor also know as gamma (γ) is used to balance how much focus is put on the current and future reward. A high discount factor means that future rewards will be considered more heavily.

To perform updates on this table we will let the agent explpore the enviornment for a certain period of time and use each of its actions to make an update. Slowly we should start to notice the agent learning and choosing better actions.

Q-Learning Example

For this example we will use the Q-Learning algorithm to train an agent to navigate a popular enviornment from the Open AI Gym. The Open AI Gym was developed so programmers could practice machine learning using unique enviornments. Intersting fact, Elon Musk is one of the founders of OpenAI!

Let’s start by looking at what Open AI Gym is.

Once you import gym you can load an environmentusing the line

There are a few other commands that can be used to interact and get information about the environment.

Frozen Lake Environment

Now that we have a basic understanding of how the gym environment works it’s time to discuss the specific problem we will be solving.

The environment we loaded above FrozenLake-v0 is one of the simplest environment in Open AI Gym. The goal of the agent is to navigate a frozen lake and find the Goal without falling through the ice (render the environment above to see an example).

There are:

  • 16 states (one for each square)
  • 4 possible actions (LEFT, RIGHT, DOWN, UP)
  • 4 different types of blocks (F: frozen, H: hole, S: start, G: goal)

Building the Q-Table

The first thing we need to do is build an empty Q-Table that we can use to store and update our values.


As we discussed we need to define some constants that will be used to update our Q-Table and tell our agent when to stop training.

Picking an Action

Remember that we can pick an action using one of two methods:

  • Randomly picking a valid action
  • Using the current Q-Table to find the best action.

Here we will define a new value ϵ that will tell us the probabillity of selecting a random action. This value will start off very high and slowly decrease as the agent learns more about the environment.

code to pick action

Updating Q Values

The code below implements the formula discussed above.

Putting it Together

Now that we know how to do some basic things we can combine these together to create our Q-Learning algorithm,

well we can do more than this like implementing Ai in some classic games like flappy bird and mario and others but that’s too much for the blog but if you are interested you can checkout the github repo to play with the code


Just a boring guy who falls in love with machine learning