Reinforcement Learning - Theoretical Foundations: Part III

Model - Free Prediction

1. Introduction

This is an important chapter that lays the fundamental work for model-free control algorithms. In this blog we shall see a few important ideas (MC, TD, online/offline, forward/backward learning) being discussed. While this chapter is not math-intense, it is imperative for us to remember the concepts before moving onto control algorithms.

To begin with, note that this is a prediction problem. Hence we are still only going to predict the final based on a given . However, the Model-free part suggests that we no longer require an MDP to be explicitly defined/given. This is because we are going to use a sampling strategy.

  • Sampling:
    If a strategy derives certain functions (in this case ) directly via episodes of observations, we say that this strategy applies sampling method.

2. Monte Carlo (MC) Policy Evaluation

  • MC learns from complete episodes (no bootstrapping)
  • Monte-Carlo policy evaluation uses empirical mean return instead of expected return
  1. First-step MC
  • For each state , only conisder the first time that is visited in each episode (update at most 1 time per run)
  • Increment counter
  • Increment total return
  • Estimate mean return value
  • The estimator converges to when number of visits approaches infinity
  1. Every-step MC
  • For each state , conisder each time that is visited in each episode (update at most 1 time per run)
  • For instance, , then we update and where
  • The remaining part are the same as First-step MC
  1. Incremental Monte-Carlo Updates
  • Based on the idea

  • Hence we take perspective of each episodes observations instead of states .

  • In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes. Hence the formula is tweeked.

    Since MC only updates when the episode terminates (hence 'complete'), it cannot be applied to process which may run infinitely.

3. Temporal Difference (TD) Learning

  • A solution to the 'incomplete' episodes problem by applying bootstrapping
  • The above is replaced by an estimate (estimated return) based on the Bellman Expectation Equation.
  • is called TD target
  • is called TD error
  • As now only depends on the next time-step and not the entire episode result , we say it is online (and hence the other method which updates everything when episode ends is offline).

4. Bias Variance trade-off

  • Computation using is unbiased, but with high variance due to all rewards from transitions and actions selected at random
  • Computation using TD target is biased, but with much lower variance, since we we only have one reward to be varied (note that is known and fixed during update of
  • By the classical trade-off analysis, we see that
    • TD is more sensitive to inital values
    • TD is usually more efficient than MC

5. More comparisons between TD and MC

converges to solution with minimum mean-squared error max likelihood Markov model
Convergence with function approximation Yes No
Exploits Markov property No Yes
More efficient in Non-Markov env Markov env

6. TD()

Now we explore alternative estimators for

  • n-step return:
  • -return:

Here by utilising , we have TD() as a new evaluation policy. However, notice from the expression above that is forward-looking, meaning that this would require transition steps until the episodes end. This faces exactly the same limitation as MC!!!

7. Solution: Backward online evaluation

Instead of look for values in future time-steps, we use values from earlier time-steps. This method is called a backward view of TD(). In essence, we update online, every step, from incomplete sequences.

Based on a theorem, the sum of offline updates is identical for forward-view and backward-view TD().

8. Eligibility Trace

The key to Backward TD() is eligbility trace, we intuitively derive contributing factors from earlier time steps as follows:

  • Frequency heuristic: assign credit to most frequent states
  • Recency heuristic: assign credit to most recent states ( discount rate)

Combing 2 heuristics above, we obtain a rule and 𝟙.
Hence the new update formula is . Observe now that we need to update for every state upon each time-step.

9. Additonal note

  • TD(1) is roughly equivalent to every-visit Monte-Carlo
  • TD(0) is exactly equivalent to simple TD

Model Free Control

1. Main objective

Instead of estimating the value function, we try to optimize the value function with an unknown MDP.

2. Recap of On-Policy vs Off-Policy

  1. On-policy
    Learn about from experience sampled from previous rounds in .
    In essence trying to improve by running it using current agent iteself.

  2. Off-policy
    Learn about from experience sampled from previous rounds (or complete run) in .
    In essence trying to improve by Observing another policy getting run by another agent and deduce several directions to improve .

3. Generalised Policy Iteration With Monte-Carlo Evaluation

In general we can following the policy iteration method introduced in chapter 3 following 2 steps:

  • Policy evaluation - Monte-Carlo policy evaluation:

    • We want to evaluate . However, with out MDP, we cannot determine easily using a simple State-Value Function. So we must resort to a Action-Value Function, i.e., .
    • We further observe that in each iteration, we must run the full policy to obtain for every action/state pair. This is highly inefficient as we do not know how long it takes. Hence instead we do episodic updates using . That is, we do not fully evaluate that policy, but sample state-action pair with current policy for times per episode and immediately improve the policy upon that. This manually set by us imposes guarantee on sampling complexity.
    • We call the strategy above GILE MC control as it satisfies the GILE property.
    • In conclusion, the evaluation phase is as follows:
      • Sample kth episode using
      • For each state and action in the episode:
      • ,
  • Policy improvement: -Greedy policy improvement

    • We choose actions that greedily maximizes the . We allow some degree of exploration by making such greedy choice with probability.

    • Nothat this -Greedy policy works as we always have improvement like the proof shown in DP note.

4. GILE property

Greedy in the Limit with Infinite Exploration (GLIE) has the following 2 parts:

  • All state-action pairs are explored infinitely many times:

  • The policy converges on a greedy policy,

For example, -greedy is GLIE if reduces to zero at .
This property enables the following theorem:

  • GLIE Monte-Carlo control converges to the optimal action-value function:

5. Temporal Difference method

  1. From MC to TD
    Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC)
  • Lower variance
  • Online
  • Incomplete sequences
  1. Sarsa update of
    The most basic form is . Now recall from model-free prediction the variations of TD:
  • n-step Sarsa
  • Forward View Sarsa()
  • Backward View Sarsa(): we use eligibility traces in an online algorithm
    • .
    • In each iteration, we upate for every pair.
  1. Off-policy learning
    In this case, we evaluate target policy to compute or , but the evaluation was based on another (ongoing or completed) policy run

It has the following advantages:

  • Learn from observing humans or other agents
  • Re-use experience generated from old policies
  • Learn about optimal policy while following exploratory policy
  • Learn about multiple policies while following one policy
  1. Importance sampling for Off-policy
    We note that we can estimate the expectation of a different distribution via:

One may trie to use returns generated from to evaluate via multiple sampling corrections:

and then

However, this multiple chaining may result in:

  • Invalud computation when one of the while
  • Dramatically increasing variance

Hence, we consider adopting TD target for importance sampling instead of actual return . This removes the multiple chaining as the expression becomes:

Unfortunately, in the above expression, we are still sticking to the policy in choosing when we update our for . This is not very reasonable, as our policy could potentially have a better choice of action. Importance sampling discounts this fact. Hence we may seek for alternative solution that removes to need to do importance sampling.

6. Q-Learning

Q-learning is a method that resolves the above issue. We now consider update based on : we choose maximizer . This allows both behaviour and target policies to improve. Note that in this case, is improved via a -greedy case since is chosen randomly with probability and by theorem.

Reinforcement Learning - Theoretical Foundations: Part III


Zhenlin Wang

Posted on


Updated on


Licensed under