# Reinforcement Learning - Theoretical Foundations: Part III

### Model - Free Prediction

#### 1. Introduction

This is an important chapter that lays the fundamental work for **model-free control** algorithms. In this blog we shall see a few important ideas (MC, TD, online/offline, forward/backward learning) being discussed. While this chapter is not math-intense, it is imperative for us to remember the concepts before moving onto control algorithms.

To begin with, note that this is a **prediction** problem. Hence we are still only going to predict the final **Model-free** part suggests that we no longer require an MDP to be explicitly defined/given. This is because we are going to use a sampling strategy.

- Sampling:

If a strategy derives certain functions (in this case) directly via episodes of observations, we say that this strategy applies sampling method.

#### 2. Monte Carlo (MC) Policy Evaluation

- MC learns from complete episodes (no bootstrapping)
- Monte-Carlo policy evaluation uses
*empirical mean return*instead of*expected return*

- First-step MC

- For each state
, only conisder the first time that is visited in each episode (update at most 1 time per run) - Increment counter
- Increment total return
- Estimate mean return value
- The estimator
converges to when number of visits approaches infinity

- Every-step MC

- For each state
, conisder each time that is visited in each episode (update at most 1 time per run) - For instance,
, then we update and where - The remaining part are the same as First-step MC

- Incremental Monte-Carlo Updates

Based on the idea

Hence we take perspective of each episodes observations

instead of states . In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes. Hence the formula is tweeked.

**Limitation**:

Since MC only updates when the episode terminates (hence 'complete'), it cannot be applied to process which may run infinitely.

#### 3. Temporal Difference (TD) Learning

- A solution to the 'incomplete' episodes problem by applying bootstrapping
- The
above is replaced by an estimate (estimated return) based on the Bellman Expectation Equation. is called **TD target**is called **TD error**- As now
only depends on the next time-step and not the entire episode result , we say it is **online**(and hence the other method which updates everything when episode ends is**offline**).

#### 4. Bias Variance trade-off

- Computation using
is unbiased, but with high variance due to all rewards from transitions and actions selected at random - Computation using TD target is biased, but with much lower variance, since we we only have one reward
to be varied (note that is known and fixed during update of - By the classical trade-off analysis, we see that
- TD is more sensitive to inital values
- TD is usually more efficient than MC

#### 5. More comparisons between TD and MC

MC | TD | |
---|---|---|

converges to solution with | minimum mean-squared error | max likelihood Markov model |

Convergence with function approximation | Yes | No |

Exploits Markov property | No | Yes |

More efficient in | Non-Markov env | Markov env |

#### 6. TD( )

Now we explore alternative estimators for

**n-step return**::-return

Here by utilising

#### 7. Solution: Backward online evaluation

Instead of look for values in future time-steps, we use values from earlier time-steps. This method is called a backward view of TD(

Based on a theorem, the sum of offline updates is identical for forward-view and backward-view TD(

#### 8. Eligibility Trace

The key to Backward TD(

*Frequency heuristic*: assign credit to most frequent states*Recency heuristic*: assign credit to most recent states (discount rate)

Combing 2 heuristics above, we obtain a rule

Hence the new update formula is

#### 9. Additonal note

- TD(1) is roughly equivalent to every-visit Monte-Carlo
- TD(0) is exactly equivalent to simple TD

### Model Free Control

#### 1. Main objective

Instead of estimating the value function, we try to optimize the value function with an unknown MDP.

#### 2. Recap of On-Policy vs Off-Policy

On-policy

Learn aboutfrom experience sampled from previous rounds in .

In essence trying to improveby running it **using current agent iteself**.Off-policy

Learn aboutfrom experience sampled from previous rounds (or complete run) in .

In essence trying to improveby **Observing another policy**getting run by**another agent**and**deduce several directions**to improve.

#### 3. Generalised Policy Iteration With Monte-Carlo Evaluation

In general we can following the policy iteration method introduced in chapter 3 following 2 steps:

**Policy evaluation**- Monte-Carlo policy evaluation:- We want to evaluate
. However, with out MDP, we cannot determine easily using a simple **State-Value Function**. So we must resort to a**Action-Value Function**, i.e.,. - We further observe that in each iteration, we must run the full policy to obtain
for every action/state pair. This is highly inefficient as we do not know how long it takes. Hence instead we do episodic updates using . That is, we do **not fully evaluate**that policy, but**sample state-action pair with current policy for**andtimes per episode **immediately improve**the policy upon that. Thismanually set by us imposes guarantee on sampling complexity. - We call the strategy above
*GILE MC control*as it satisfies the**GILE**property. - In conclusion, the evaluation phase is as follows:
- Sample kth episode using
- For each state
and action in the episode: ,

- Sample kth episode using

- We want to evaluate
Policy improvement:

-Greedy policy improvement - We choose actions that greedily maximizes the
. We allow some degree of exploration by making such greedy choice with probability.

- Nothat this
-Greedy policy works as we always have improvement like the proof shown in DP note.

- We choose actions that greedily maximizes the

#### 4. GILE property

Greedy in the Limit with Infinite Exploration (GLIE) has the following 2 parts:

All state-action pairs are explored infinitely many times:

The policy converges on a greedy policy,

For example,

This property enables the following theorem:

- GLIE Monte-Carlo control converges to the optimal action-value function:

#### 5. Temporal Difference method

- From MC to TD

Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC)

- Lower variance
- Online
- Incomplete sequences

- Sarsa update of

The most basic form is. Now recall from model-free prediction the variations of TD:

**n-step Sarsa****Forward View Sarsa(**) **Backward View Sarsa(**: we use) *eligibility traces*in an online algorithm. - In each iteration, we upate
for every pair.

- Off-policy learning

In this case, we evaluate target policyto compute or , but the evaluation was based on another (ongoing or completed) policy run

It has the following advantages:

- Learn from observing humans or other agents
- Re-use experience generated from old policies
- Learn about optimal policy while following
*exploratory*policy - Learn about multiple policies while following
**one**policy

- Importance sampling for Off-policy

We note that we can estimate the expectation of a different distribution via:

One may trie to use returns generated from

and then

However, this multiple chaining may result in:

- Invalud computation when one of the
while - Dramatically increasing variance

Hence, we consider adopting TD target

Unfortunately, in the above expression, we are still sticking to the

#### 6. Q-Learning

Q-learning is a method that resolves the above issue. We now consider

Reinforcement Learning - Theoretical Foundations: Part III