Reinforcement Learning - Theoretical Foundations: Part I


Recently I've been learning about reinforcement learning from amazing lectures from David Silver. These provide an overview of the classical algorithms in RL and potential challenges for future researches, in the subsequent blogs, I'll talk about the major aspects of RL and provide some solid math details on how algorithms in RL is executed.

What is Reinforcement Learning

An RL agent may include one or more of these components:

  • Policy: agent’s behaviour function
    • It is a map from state to action
    • Deterministic policy: a =
    • Stochastic policy:
  • Value function: how good is each state and/or action
    • Value function is a prediction of future reward
    • Used to evaluate the goodness/badness of states
    • And therefore to select between actions
  • Model: agent’s representation of the environment
    • A model predicts what the environment will do next
    • predicts the next state
    • predicts the next (immediate) reward, this often takes the form of expectation

It is derived for a classical problem - Exploration vs Exploitation. There is no supervisor, only a reward signal. Sometimes, feedback is delayed, not instantaneous. Time really matters (sequential, non i.i.d data); and Agent's actions affect the subsequent data it receives.

Prediction vs Control

RL problems is often classified into a prediction problem or a control problem

  • Prediction: Given a policy, evaluate the future
  • Control: Find the best policy

Markov Decision Process (MDP)

Before venturing into the exact algorithms, let's lay out some fundamental math concepts here.

Prior Knowledge

  • Basic Probability Knowledge
  • Basic Stochastic Process:
    • Markov Chain
    • Transition Matrix

Problem setup

  1. This is an RL setting where the environment is fully observable
  2. The current state completely characterises the process
  3. Almost all RL problems can be formalised as MDPs (Bandit are MDP with 1 state & finite/infinite actions)


  1. Markov Reward Process
  • A Markov reward process is a Markov chain with values.
  • In addition to , we have reward function and a discount factor
  1. Return
  • The return is the total discounted reward from time-step :
  • Intuitively, the discount factor favours future rewards at a nearer date
  1. State-value function
  • The state value function of an MRP is the expected return starting from state
  1. Bellman Equation
  • We apply one-step analysis on and observe that:
  • Therefore, we find that value function of any state is only dependent on its outgoing states (successors)
  • We can then construct a matrix equation: , which can be solved in (hence very expensive)
  • For large state set, more efficient algorithms utilising Dynamic programming (DP) is preferred
  1. MDP
  • A Markov decision process (MDP) is a Markov reward process with decisions
  • In addition to states, we have a set of actions
  • The probability and reward functions now conditionally depend on and simultaneous as follows
  1. Policy
  • A policy is a distribution over actions given states:
  • A policy is stationary (the probability does not change for different iterations)
  1. State-value function and Action-value function for MDP (Differs from 3.)
  • State-value function: the expectation taken w.r.t the policy
  • Action-value function:
  • Note that
  1. Applying Bellman equation on and
  • Note that we cannot do simple one-step analysis since there are 2 variables and now.
  • Instead, we try to apply OSA on w.r.t , and then do OSA on w.r.t to get Bellman Expectation Equation on , and swap the 2 variables’ order to derive equation for
  • First step (Bellman Expectation Equation):
  • Second step (Bellman Optimality Equation):
  1. Optimality
  • We try to maximize and (Goal 1)
  • Theorem suggests existence of policy that achieves Goal 1 deterministically.
  • An optimal policy can be found by choosing actions for each state such that the choices maximise over
  1. Solving for optimality
  • The Bellman equations in 8 is often nonlinear and has no closed form in general
  • We often need to apply sequential methods to solve it
  • Value iteration
  • Policy Iteration
  • Q-Learning
  • Sarsa
  1. To read more on extensions, refer to Page 49 of this slides.

Reinforcement Learning - Theoretical Foundations: Part I


Zhenlin Wang

Posted on


Updated on


Licensed under