Blogs · Reinforcement Learning · Control Theory

Reinforcement Learning - Theoretical Foundations: Part III

RL Continued - Model-free algorithms

2021.01.07 · 10 min read · by Zhenlin Wang · updated 2022-01-22

Model - Free Prediction

1. Introduction

This is an important chapter that lays the fundamental work for model-free control algorithms. In this blog we shall see a few important ideas (MC, TD, online/offline, forward/backward learning) being discussed. While this chapter is not math-intense, it is imperative for us to remember the concepts before moving onto control algorithms.

To begin with, note that this is a prediction problem. Hence we are still only going to predict the final $v_{\pi}(s)$ based on a given $\pi$. However, the Model-free part suggests that we no longer require an MDP to be explicitly defined/given. This is because we are going to use a sampling strategy.

2. Monte Carlo (MC) Policy Evaluation

  1. First-step MC
  1. Every-step MC
  1. Incremental Monte-Carlo Updates

3. Temporal Difference (TD) Learning

4. Bias Variance trade-off

5. More comparisons between TD and MC

MCTD
converges to solution withminimum mean-squared errormax likelihood Markov model
Convergence with function approximationYesNo
Exploits Markov propertyNoYes
More efficient inNon-Markov envMarkov env

6. TD($\lambda$)

Now we explore alternative estimators for $G_t$

Here by utilising $G^{\lambda}_t$, we have TD($\lambda$) as a new evaluation policy. However, notice from the expression above that $G^{\lambda}_t$ is forward-looking, meaning that this would require $n$ transition steps until the episodes end. This faces exactly the same limitation as MC!!!

7. Solution: Backward online evaluation

Instead of look for values in future time-steps, we use values from earlier time-steps. This method is called a backward view of TD($\lambda$). In essence, we update online, every step, from incomplete sequences.

Based on a theorem, the sum of offline updates is identical for forward-view and backward-view TD($\lambda$).

8. Eligibility Trace

The key to Backward TD($\lambda$) is eligbility trace, we intuitively derive contributing factors from earlier time steps as follows:

Combing 2 heuristics above, we obtain a rule $E_0(s) = 0$ and $E_t(s) = \gamma\lambda E_{t-1}(s) + \mathbb{1}{S_{t} = s}$. Hence the new update formula is $V(s) ← V(s) +\alpha\delta_tE_t(s)$. Observe now that we need to update $V$ for every state $s$ upon each time-step.

9. Additonal note

Model Free Control

1. Main objective

Instead of estimating the value function, we try to optimize the value function with an unknown MDP.

2. Recap of On-Policy vs Off-Policy

  1. On-policy Learn about $\pi$ from experience sampled from previous rounds in $\pi$. In essence trying to improve $\pi$ by running it using current agent iteself.

  2. Off-policy Learn about $\pi$ from experience sampled from previous rounds (or complete run) in $\pi’$. In essence trying to improve $\pi$ by Observing another policy getting run by another agent and deduce several directions to improve $\pi$ .

3. Generalised Policy Iteration With Monte-Carlo Evaluation

In general we can following the policy iteration method introduced in chapter 3 following 2 steps:

4. GILE property

Greedy in the Limit with Infinite Exploration (GLIE) has the following 2 parts:

For example, $\epsilon$-greedy is GLIE if $\epsilon$ reduces to zero at $\epsilon_k = \frac{1}{k}$. This property enables the following theorem:

5. Temporal Difference method

  1. From MC to TD Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC)
  1. Sarsa update of $Q$ The most basic form is $Q(S, A) ← Q(S, A) + \alpha (R + \gamma Q(S’ , A’ ) − Q(S, A))$. Now recall from model-free prediction the variations of TD:
  1. Off-policy learning In this case, we evaluate target policy $\pi(a|s)$ to compute $v_{\pi}(s)$ or $q_{\pi}(s,a)$, but the evaluation was based on another (ongoing or completed) policy run $\mu(a|s): (S_1, A_1, R_2, …, S_T) \sim \mu$

It has the following advantages:

  1. Importance sampling for Off-policy We note that we can estimate the expectation of a different distribution via: $$\mathbb{E}_{X \sim P}[f(X)] = \sum Q(X)\frac{P(X)}{Q(X)}f(X) = \mathbb{E}_{X \sim Q}\left[\frac{P(X)}{Q(X)}f(X)\right]$$

One may trie to use returns generated from $\mu$ to evaluate $\pi$ via multiple sampling corrections:

$$G_t^{\frac{\pi}{\mu}} = G_t^{\mu} \times \frac{\pi(A_t\mid S_t)}{\mu(A_t\mid S_t)}\frac{\pi(A_{t+1}\mid S_{t+1})}{\mu(A_{t+1}\mid S_{t+1})}…\frac{\pi(A_T\mid S_T)}{\mu(A_T\mid S_T)}$$

and then

$$Q(S, A) ← Q(S, A) + \alpha (G_t^{\frac{\pi}{\mu}} − Q(S, A))$$

However, this multiple chaining may result in:

Hence, we consider adopting TD target $R + \gamma Q(S’ , A’ )$ for importance sampling instead of actual return $G_t$. This removes the multiple chaining as the expression becomes:

Unfortunately, in the above expression, we are still sticking to the $\mu$ policy in choosing $A_{t+1}$ when we update our $Q$ for $\pi$. This is not very reasonable, as our policy could potentially have a better choice of action. Importance sampling discounts this fact. Hence we may seek for alternative solution that removes to need to do importance sampling.

6. Q-Learning

Q-learning is a method that resolves the above issue. We now consider $Q$ update based on $\pi$: we choose maximizer $A’ \sim \pi(\cdot | S_{t})$. This allows both behaviour and target policies to improve. Note that in this case, $\mu$ is improved via a $\epsilon$-greedy case since $A’$ is chosen randomly with $\epsilon$ probability and $Q \to q_*$ by theorem.