Reinforcement Learning - Theoretical Foundations: Part V
Policy Gradient
1. General Overview
Model based RL:
- Pros
- 'Easy' to learn: via procedure similar to supervised learning
- Learns everything from the data
- Cons
- Objective captures irrelevant info
- May focus on irrelevant details
- Computing policy (planning) is non-trivial and expensive (offline/static evaluation)
- Pros
Value based RL:
- Pros
- Closer to True objective
- Fairly well understandable - somewhat similar to regression
- Cons
- Still not the true objective (similar to model-based)
- Pros
Policy based RL:
- Pros
- Always try to find the true objective (directly targeting the policy)
- Cons
- Ignore some useful learnable knowledge (like policy state/action values) [but can be overcomed by combining with value-function approximation]
- Pros
2. Model-free Policy based RL
- We directly parametrize the policy via;
- Advantages:
- Better convergence properties (gradient method)
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic optimal policies (like a scissor-paper-stone game)
- Sometimes policies are simple while values and models are more complex (large environment but easy policy)
- Disadvantages:
- Susceptible to local optima (especially with non-linear function approximator)
- Often obtain knowledge that is specifc and does not always generalize well (high variance)
- Ignores a lot of information in the data (when used in isolation)
3. Policy Objective function
Here is a list of functions that can be used potentially measure the quality of policy
- In episodic environment, we can use the start value:
- In continuous environment, we can use the average value:
- where
is the probability of being in state in the long run (long-term proportion)
- Otherwise, we replace the value function
with the reward function so:
Since the main target is now to optimize
4. Computing the policy gradient analytically
We know that the most important part of any policy function
We then say that the score function (gradient base) of a policy
We further note that if
$$\nabla_{\theta}J(\theta) = \mathbb{E}{\pi{\theta}}[f(S,A)\nabla_{\theta} \log \pi_{\theta}(S,A)]$$
- This is called the score function trick.
- One useful property is that
if b does not depend on the action . (expr 1)
Now consider
For any differentiable policy
, the policy gradient where is the long-term value Proof: Let's consider the expected return
as the objective where is the filtration. note here is dependent on as it affects the filtration. Now: where is policy probabilty of this filtration . (Applying the score function trick)
- So:
- We further notice that
is now a constant for every pair in for . However, for any , does not depend on , and by expr 1 above, we can see that .
5. Actor-Critic Algorithm
- Most basic Q Actor-Critic
- policy gradient still has high variance
- We can use a critic to estimate the action-value function:
- Actor-critic algorithms maintain two sets of parameters
- Critic: Updates action-value function parameters
- Actor: Updates policy parameters
, in direction suggested by critic
- Critic: Updates action-value function parameters
- Actor-critic algorithms follow an approximate policy gradient:
- The critic is solving a familiar problem: policy evaluation, which now can be solve using methods via value-based methods.
- However, this approximation of the policy gradient introduces bias. A biased policy gradient may not find the right solution. We can choose value function approximation carefully so that this bias is removed. This is possible because of the Compatible Function Approximation Theorem below:
- If the following two conditions are satisfied:
- Value function approximator is compatible to the policy
- Value function parameters w minimise the mean-squared error
- Value function approximator is compatible to the policy
- Then
exactly.
- If the following two conditions are satisfied:
- Advantage Actor-Critic
- Recall expr 1, we can again apply this on
to further reduce the variance introduced by a large value. - Consider
. This is called an advantage function. Since does not depend on the actions, so plays no role here. Then expr 1 with formula results in the following
- TD Actor-Critic
- We now apply approximation again (Compatible Function Approximation Theorem) on
using the an estimated TD error - In practice we can use an approximate TD error
- So now the critic update is
- Note given this variant that Critic can estimate value function
from many targets at different time-scales. (MC, TD(0), Forward/Backward TD( )
- Natural Actor-Critic
- refers to Policy Gradient for more on this content.
Reinforcement Learning - Theoretical Foundations: Part V