Reinforcement Learning - Theoretical Foundations: Part III

2021-01-07Blogs

Model - Free Prediction

1. Introduction

This is an important chapter that lays the fundamental work for model-free control algorithms. In this blog we shall see a few important ideas (MC, TD, online/offline, forward/backward learning) being discussed. While this chapter is not math-intense, it is imperative for us to remember the concepts before moving onto control algorithms.

To begin with, note that this is a prediction problem. Hence we are still only going to predict the final based on a given . However, the Model-free part suggests that we no longer require an MDP to be explicitly defined/given. This is because we are going to use a sampling strategy.

Sampling:
If a strategy derives certain functions (in this case ) directly via episodes of observations, we say that this strategy applies sampling method.

2. Monte Carlo (MC) Policy Evaluation

MC learns from complete episodes (no bootstrapping)
Monte-Carlo policy evaluation uses empirical mean return instead of expected return

First-step MC

For each state , only conisder the first time that is visited in each episode (update at most 1 time per run)
Increment counter
Increment total return
Estimate mean return value
The estimator converges to when number of visits approaches infinity

Every-step MC

For each state , conisder each time that is visited in each episode (update at most 1 time per run)
For instance, , then we update and where
The remaining part are the same as First-step MC

Incremental Monte-Carlo Updates

Based on the idea
Hence we take perspective of each episodes observations instead of states .
In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes. Hence the formula is tweeked.

Limitation:
Since MC only updates when the episode terminates (hence 'complete'), it cannot be applied to process which may run infinitely.

3. Temporal Difference (TD) Learning

A solution to the 'incomplete' episodes problem by applying bootstrapping
The above is replaced by an estimate (estimated return) based on the Bellman Expectation Equation.
is called TD target
is called TD error
As now only depends on the next time-step and not the entire episode result , we say it is online (and hence the other method which updates everything when episode ends is offline).

4. Bias Variance trade-off

Computation using is unbiased, but with high variance due to all rewards from transitions and actions selected at random
Computation using TD target is biased, but with much lower variance, since we we only have one reward to be varied (note that is known and fixed during update of
By the classical trade-off analysis, we see that
- TD is more sensitive to inital values
- TD is usually more efficient than MC

5. More comparisons between TD and MC

	MC	TD
converges to solution with	minimum mean-squared error	max likelihood Markov model
Convergence with function approximation	Yes	No
Exploits Markov property	No	Yes
More efficient in	Non-Markov env	Markov env

6. TD()

Now we explore alternative estimators for

n-step return:
-return:

Here by utilising , we have TD() as a new evaluation policy. However, notice from the expression above that is forward-looking, meaning that this would require transition steps until the episodes end. This faces exactly the same limitation as MC!!!

7. Solution: Backward online evaluation

Instead of look for values in future time-steps, we use values from earlier time-steps. This method is called a backward view of TD(). In essence, we update online, every step, from incomplete sequences.

Based on a theorem, the sum of offline updates is identical for forward-view and backward-view TD().

8. Eligibility Trace

The key to Backward TD() is eligbility trace, we intuitively derive contributing factors from earlier time steps as follows:

Frequency heuristic: assign credit to most frequent states
Recency heuristic: assign credit to most recent states ( discount rate)

Combing 2 heuristics above, we obtain a rule and $𝟙$ .
Hence the new update formula is . Observe now that we need to update for every state upon each time-step.

9. Additonal note

TD(1) is roughly equivalent to every-visit Monte-Carlo
TD(0) is exactly equivalent to simple TD

Model Free Control

1. Main objective

Instead of estimating the value function, we try to optimize the value function with an unknown MDP.

2. Recap of On-Policy vs Off-Policy

On-policy
Learn about from experience sampled from previous rounds in .
In essence trying to improve by running it using current agent iteself.
Off-policy
Learn about from experience sampled from previous rounds (or complete run) in .
In essence trying to improve by Observing another policy getting run by another agent and deduce several directions to improve .

3. Generalised Policy Iteration With Monte-Carlo Evaluation

In general we can following the policy iteration method introduced in chapter 3 following 2 steps:

Policy evaluation - Monte-Carlo policy evaluation:
- We want to evaluate . However, with out MDP, we cannot determine easily using a simple State-Value Function. So we must resort to a Action-Value Function, i.e., .
- We further observe that in each iteration, we must run the full policy to obtain for every action/state pair. This is highly inefficient as we do not know how long it takes. Hence instead we do episodic updates using . That is, we do not fully evaluate that policy, but sample state-action pair with current policy for times per episode and immediately improve the policy upon that. This manually set by us imposes guarantee on sampling complexity.
- We call the strategy above GILE MC control as it satisfies the GILE property.
- In conclusion, the evaluation phase is as follows:
  - Sample kth episode using
  - For each state and action in the episode:
  - ,
Policy improvement: -Greedy policy improvement
- We choose actions that greedily maximizes the . We allow some degree of exploration by making such greedy choice with probability.
- Nothat this -Greedy policy works as we always have improvement like the proof shown in DP note.

4. GILE property

Greedy in the Limit with Infinite Exploration (GLIE) has the following 2 parts:

All state-action pairs are explored infinitely many times:
The policy converges on a greedy policy,

For example, -greedy is GLIE if reduces to zero at .
This property enables the following theorem:

GLIE Monte-Carlo control converges to the optimal action-value function:

5. Temporal Difference method

From MC to TD
Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC)

Lower variance
Online
Incomplete sequences

Sarsa update of
The most basic form is . Now recall from model-free prediction the variations of TD:

n-step Sarsa
Forward View Sarsa()
Backward View Sarsa(): we use eligibility traces in an online algorithm
- .
- In each iteration, we upate for every pair.

Off-policy learning
In this case, we evaluate target policy to compute or , but the evaluation was based on another (ongoing or completed) policy run

It has the following advantages:

Learn from observing humans or other agents
Re-use experience generated from old policies
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy

Importance sampling for Off-policy
We note that we can estimate the expectation of a different distribution via:

One may trie to use returns generated from to evaluate via multiple sampling corrections:

and then

However, this multiple chaining may result in:

Invalud computation when one of the while
Dramatically increasing variance

Hence, we consider adopting TD target for importance sampling instead of actual return . This removes the multiple chaining as the expression becomes:

Unfortunately, in the above expression, we are still sticking to the policy in choosing when we update our for . This is not very reasonable, as our policy could potentially have a better choice of action. Importance sampling discounts this fact. Hence we may seek for alternative solution that removes to need to do importance sampling.

6. Q-Learning

Q-learning is a method that resolves the above issue. We now consider update based on : we choose maximizer . This allows both behaviour and target policies to improve. Note that in this case, is improved via a -greedy case since is chosen randomly with probability and by theorem.

Reinforcement Learning - Theoretical Foundations: Part III

https://criss-wang.github.io/post/blogs/rl/reinfocement-3/

Author

Zhenlin Wang

Posted on

2021-01-07

Updated on

2022-01-22

Reinforcement Learning - Theoretical Foundations: Part III

Model - Free Prediction

1. Introduction

2. Monte Carlo (MC) Policy Evaluation

3. Temporal Difference (TD) Learning

4. Bias Variance trade-off

5. More comparisons between TD and MC

6. TD()

7. Solution: Backward online evaluation

8. Eligibility Trace

9. Additonal note

Model Free Control

1. Main objective

2. Recap of On-Policy vs Off-Policy

3. Generalised Policy Iteration With Monte-Carlo Evaluation

4. GILE property

5. Temporal Difference method

6. Q-Learning

Author

Posted on

Updated on

Licensed under

Tags

Catalogue