# Reinforcement Learning - Theoretical Foundations: Part V

### Policy Gradient

#### 1. General Overview

Model based RL:

**Pros**- 'Easy' to learn: via procedure similar to supervised learning
- Learns everything from the data

**Cons**- Objective captures irrelevant info
- May focus on irrelevant details
- Computing policy (planning) is non-trivial and expensive (offline/static evaluation)

Value based RL:

**Pros**- Closer to True objective
- Fairly well understandable - somewhat similar to regression

**Cons**- Still not the true objective (similar to model-based)

Policy based RL:

**Pros**- Always try to find the true objective (directly targeting the policy)

**Cons**- Ignore some useful learnable knowledge (like policy state/action values) [but can be overcomed by combining with value-function approximation]

#### 2. Model-free Policy based RL

- We directly parametrize the policy via;
- Advantages:
- Better convergence properties (gradient method)
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic optimal policies (like a scissor-paper-stone game)
- Sometimes policies are simple while values and models are more complex (large environment but easy policy)

- Disadvantages:
- Susceptible to local optima (especially with non-linear function approximator)
- Often obtain knowledge that is specifc and does not always generalize well (high variance)
- Ignores a lot of information in the data (when used in isolation)

#### 3. Policy Objective function

Here is a list of functions that can be used potentially measure the quality of policy

- In episodic environment, we can use the
**start value**: - In continuous environment, we can use the
**average value**:- where
is the probability of being in state in the long run (long-term proportion)

- Otherwise, we replace the value function
with the reward function so:

Since the main target is now to optimize **gradient ascent**)

#### 4. Computing the policy gradient analytically

We know that the most important part of any policy function

We then say that the score function (gradient base) of a policy

We further note that if

$$\nabla_{\theta}J(\theta) = \mathbb{E}*{\pi*{\theta}}[f(S,A)\nabla_{\theta} \log \pi_{\theta}(S,A)]$$

- This is called the
**score function trick**. - One useful property is that
if b does not depend on the action . ( **expr 1**)

Now consider **Policy Gradient Theorem**:

For any differentiable policy

, the policy gradient where is the long-term value *Proof*: Let's consider the expected returnas the objective where is the filtration. note here is dependent on as it affects the filtration. Now: where is policy probabilty of this filtration . (Applying the score function trick)

- So:
- We further notice that
is now a constant for every pair in for . However, for any , does not depend on , and by **expr 1**above, we can see that.

#### 5. Actor-Critic Algorithm

- Most basic Q Actor-Critic

- policy gradient still has high variance
- We can use a
*critic*to estimate the action-value function: - Actor-critic algorithms maintain
*two*sets of parameters- Critic: Updates action-value function parameters
- Actor: Updates policy parameters
, in direction suggested by critic

- Critic: Updates action-value function parameters
- Actor-critic algorithms follow an approximate policy gradient:
- The critic is solving a familiar problem: policy evaluation, which now can be solve using methods via value-based methods.
- However, this approximation of the policy gradient introduces bias. A biased policy gradient may not find the right solution. We can choose value function approximation carefully so that this bias is removed. This is possible because of the
**Compatible Function Approximation Theorem**below:- If the following two conditions are satisfied:
- Value function approximator is compatible to the policy
- Value function parameters w minimise the mean-squared error

- Value function approximator is compatible to the policy
- Then
exactly.

- If the following two conditions are satisfied:

- Advantage Actor-Critic

- Recall
**expr 1**, we can again apply this onto further reduce the variance introduced by a large value. - Consider
. This is called an **advantage function.**Sincedoes not depend on the actions, so plays no role here. Then **expr 1**withformula results in the following

- TD Actor-Critic

- We now apply approximation again (Compatible Function Approximation Theorem) on
using the an estimated TD error - In practice we can use an approximate TD error
- So now the critic update is
- Note given this variant that Critic can estimate value function
from many targets at different time-scales. (MC, TD(0), Forward/Backward TD( )

- Natural Actor-Critic

- refers to Policy Gradient for more on this content.

Reinforcement Learning - Theoretical Foundations: Part V