Reinforcement Learning - Theoretical Foundations: Part III
Model - Free Prediction
1. Introduction
This is an important chapter that lays the fundamental work for model-free control algorithms. In this blog we shall see a few important ideas (MC, TD, online/offline, forward/backward learning) being discussed. While this chapter is not math-intense, it is imperative for us to remember the concepts before moving onto control algorithms.
To begin with, note that this is a prediction problem. Hence we are still only going to predict the final
- Sampling:
If a strategy derives certain functions (in this case) directly via episodes of observations, we say that this strategy applies sampling method.
2. Monte Carlo (MC) Policy Evaluation
- MC learns from complete episodes (no bootstrapping)
- Monte-Carlo policy evaluation uses empirical mean return instead of expected return
- First-step MC
- For each state
, only conisder the first time that is visited in each episode (update at most 1 time per run) - Increment counter
- Increment total return
- Estimate mean return value
- The estimator
converges to when number of visits approaches infinity
- Every-step MC
- For each state
, conisder each time that is visited in each episode (update at most 1 time per run) - For instance,
, then we update and where - The remaining part are the same as First-step MC
- Incremental Monte-Carlo Updates
Based on the idea
Hence we take perspective of each episodes observations
instead of states . In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes. Hence the formula is tweeked.
Limitation:
Since MC only updates when the episode terminates (hence 'complete'), it cannot be applied to process which may run infinitely.
3. Temporal Difference (TD) Learning
- A solution to the 'incomplete' episodes problem by applying bootstrapping
- The
above is replaced by an estimate (estimated return) based on the Bellman Expectation Equation. is called TD target is called TD error - As now
only depends on the next time-step and not the entire episode result , we say it is online (and hence the other method which updates everything when episode ends is offline).
4. Bias Variance trade-off
- Computation using
is unbiased, but with high variance due to all rewards from transitions and actions selected at random - Computation using TD target is biased, but with much lower variance, since we we only have one reward
to be varied (note that is known and fixed during update of - By the classical trade-off analysis, we see that
- TD is more sensitive to inital values
- TD is usually more efficient than MC
5. More comparisons between TD and MC
MC | TD | |
---|---|---|
converges to solution with | minimum mean-squared error | max likelihood Markov model |
Convergence with function approximation | Yes | No |
Exploits Markov property | No | Yes |
More efficient in | Non-Markov env | Markov env |
6. TD( )
Now we explore alternative estimators for
- n-step return:
-return:
Here by utilising
7. Solution: Backward online evaluation
Instead of look for values in future time-steps, we use values from earlier time-steps. This method is called a backward view of TD(
Based on a theorem, the sum of offline updates is identical for forward-view and backward-view TD(
8. Eligibility Trace
The key to Backward TD(
- Frequency heuristic: assign credit to most frequent states
- Recency heuristic: assign credit to most recent states (
discount rate)
Combing 2 heuristics above, we obtain a rule
Hence the new update formula is
9. Additonal note
- TD(1) is roughly equivalent to every-visit Monte-Carlo
- TD(0) is exactly equivalent to simple TD
Model Free Control
1. Main objective
Instead of estimating the value function, we try to optimize the value function with an unknown MDP.
2. Recap of On-Policy vs Off-Policy
On-policy
Learn aboutfrom experience sampled from previous rounds in .
In essence trying to improveby running it using current agent iteself. Off-policy
Learn aboutfrom experience sampled from previous rounds (or complete run) in .
In essence trying to improveby Observing another policy getting run by another agent and deduce several directions to improve .
3. Generalised Policy Iteration With Monte-Carlo Evaluation
In general we can following the policy iteration method introduced in chapter 3 following 2 steps:
Policy evaluation - Monte-Carlo policy evaluation:
- We want to evaluate
. However, with out MDP, we cannot determine easily using a simple State-Value Function. So we must resort to a Action-Value Function, i.e., . - We further observe that in each iteration, we must run the full policy to obtain
for every action/state pair. This is highly inefficient as we do not know how long it takes. Hence instead we do episodic updates using . That is, we do not fully evaluate that policy, but sample state-action pair with current policy for times per episode and immediately improve the policy upon that. This manually set by us imposes guarantee on sampling complexity. - We call the strategy above GILE MC control as it satisfies the GILE property.
- In conclusion, the evaluation phase is as follows:
- Sample kth episode using
- For each state
and action in the episode: ,
- Sample kth episode using
- We want to evaluate
Policy improvement:
-Greedy policy improvement - We choose actions that greedily maximizes the
. We allow some degree of exploration by making such greedy choice with probability.
- Nothat this
-Greedy policy works as we always have improvement like the proof shown in DP note.
- We choose actions that greedily maximizes the
4. GILE property
Greedy in the Limit with Infinite Exploration (GLIE) has the following 2 parts:
All state-action pairs are explored infinitely many times:
The policy converges on a greedy policy,
For example,
This property enables the following theorem:
- GLIE Monte-Carlo control converges to the optimal action-value function:
5. Temporal Difference method
- From MC to TD
Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC)
- Lower variance
- Online
- Incomplete sequences
- Sarsa update of
The most basic form is. Now recall from model-free prediction the variations of TD:
- n-step Sarsa
- Forward View Sarsa(
) - Backward View Sarsa(
): we use eligibility traces in an online algorithm . - In each iteration, we upate
for every pair.
- Off-policy learning
In this case, we evaluate target policyto compute or , but the evaluation was based on another (ongoing or completed) policy run
It has the following advantages:
- Learn from observing humans or other agents
- Re-use experience generated from old policies
- Learn about optimal policy while following exploratory policy
- Learn about multiple policies while following one policy
- Importance sampling for Off-policy
We note that we can estimate the expectation of a different distribution via:
One may trie to use returns generated from
and then
However, this multiple chaining may result in:
- Invalud computation when one of the
while - Dramatically increasing variance
Hence, we consider adopting TD target
Unfortunately, in the above expression, we are still sticking to the
6. Q-Learning
Q-learning is a method that resolves the above issue. We now consider
Reinforcement Learning - Theoretical Foundations: Part III