Reinforcement Learning - Theoretical Foundations: Part V

Policy Gradient

1. General Overview

  • Model based RL:

    • Pros
      • 'Easy' to learn: via procedure similar to supervised learning
      • Learns everything from the data
    • Cons
      • Objective captures irrelevant info
      • May focus on irrelevant details
      • Computing policy (planning) is non-trivial and expensive (offline/static evaluation)
  • Value based RL:

    • Pros
      • Closer to True objective
      • Fairly well understandable - somewhat similar to regression
    • Cons
      • Still not the true objective (similar to model-based)
  • Policy based RL:

    • Pros
      • Always try to find the true objective (directly targeting the policy)
    • Cons
      • Ignore some useful learnable knowledge (like policy state/action values) [but can be overcomed by combining with value-function approximation]

2. Model-free Policy based RL

  • We directly parametrize the policy via;
  • Advantages:
    • Better convergence properties (gradient method)
    • Effective in high-dimensional or continuous action spaces
    • Can learn stochastic optimal policies (like a scissor-paper-stone game)
    • Sometimes policies are simple while values and models are more complex (large environment but easy policy)
  • Disadvantages:
    • Susceptible to local optima (especially with non-linear function approximator)
    • Often obtain knowledge that is specifc and does not always generalize well (high variance)
    • Ignores a lot of information in the data (when used in isolation)

3. Policy Objective function

Here is a list of functions that can be used potentially measure the quality of policy . Each is evaluated depending on the things we are concerned about:

  • In episodic environment, we can use the start value:
  • In continuous environment, we can use the average value:
    • where is the probability of being in state in the long run (long-term proportion)
  • Otherwise, we replace the value function with the reward function so:

Since the main target is now to optimize , we can now simply apply gradient-based method to solve the problem (in this case, gradient ascent)

4. Computing the policy gradient analytically

We know that the most important part of any policy function is just the policy expression . Hence, assuming that is differentiable, we find:

We then say that the score function (gradient base) of a policy is just .

We further note that if is an expectaion function dependent on , i.e., , we can always apply this gradient base inside the expectation for gradient computation:

$$\nabla_{\theta}J(\theta) = \mathbb{E}{\pi{\theta}}[f(S,A)\nabla_{\theta} \log \pi_{\theta}(S,A)]$$

  • This is called the score function trick.
  • One useful property is that if b does not depend on the action . (expr 1)

Now consider formally, we have the following Policy Gradient Theorem:

  • For any differentiable policy , the policy gradient where is the long-term value

  • Proof: Let's consider the expected return as the objective where is the filtration. note here is dependent on as it affects the filtration. Now:

    • where is policy probabilty of this filtration . (Applying the score function trick)

    • So:
    • We further notice that is now a constant for every pair in for . However, for any , does not depend on , and by expr 1 above, we can see that .

5. Actor-Critic Algorithm

  1. Most basic Q Actor-Critic
  • policy gradient still has high variance
  • We can use a critic to estimate the action-value function:
  • Actor-critic algorithms maintain two sets of parameters
    • Critic: Updates action-value function parameters
    • Actor: Updates policy parameters , in direction suggested by critic
  • Actor-critic algorithms follow an approximate policy gradient:
  • The critic is solving a familiar problem: policy evaluation, which now can be solve using methods via value-based methods.
  • However, this approximation of the policy gradient introduces bias. A biased policy gradient may not find the right solution. We can choose value function approximation carefully so that this bias is removed. This is possible because of the Compatible Function Approximation Theorem below:
    • If the following two conditions are satisfied:
      1. Value function approximator is compatible to the policy
      2. Value function parameters w minimise the mean-squared error
    • Then exactly.
  1. Advantage Actor-Critic
  • Recall expr 1, we can again apply this on to further reduce the variance introduced by a large value.
  • Consider . This is called an advantage function. Since does not depend on the actions, so plays no role here. Then expr 1 with formula results in the following
  1. TD Actor-Critic
  • We now apply approximation again (Compatible Function Approximation Theorem) on using the an estimated TD error
  • In practice we can use an approximate TD error
  • So now the critic update is
  • Note given this variant that Critic can estimate value function from many targets at different time-scales. (MC, TD(0), Forward/Backward TD()
  1. Natural Actor-Critic

Reinforcement Learning - Theoretical Foundations: Part V


Zhenlin Wang

Posted on


Updated on


Licensed under