Blogs · Reinforcement Learning

Reinforcement Learning - Theoretical Foundations: Part V

RL Continued - Policy Gradient

2021.01.20 · 7 min read · by Zhenlin Wang · updated 2022-01-31

Policy Gradient

1. General Overview

2. Model-free Policy based RL

3. Policy Objective function

Here is a list of functions that can be used potentially measure the quality of policy $\pi_{\theta}$. Each is evaluated depending on the things we are concerned about:

Since the main target is now to optimize $J(\theta)$, we can now simply apply gradient-based method to solve the problem (in this case, gradient ascent)

4. Computing the policy gradient analytically

We know that the most important part of any policy function $J(\theta)$ is just the policy expression $\pi(s,a)$. Hence, assuming that $\pi(s,a)$ is differentiable, we find:

$$\nabla_{\theta}\pi_{\theta}(s,a) = \pi_{\theta}(s,a) \frac{\nabla_{\theta}\pi_{\theta}(s,a) }{\pi_{\theta}(s,a) } = \pi_{\theta}(s,a) \nabla_{\theta} \log \pi_{\theta}(s,a)$$

We then say that the score function (gradient base) of a policy $\pi_{\theta}$ is just $\nabla_{\theta} \log \pi_{\theta}(s,a)$.

We further note that if $J(\theta)$ is an expectaion function dependent on $\pi_{\theta}(s,a)$, i.e., $J(\theta) = \mathbb{E_{\pi_{\theta}}}[f(S,A)]$, we can always apply this gradient base inside the expectation for gradient computation:

$$\nabla_{\theta}J(\theta) = \mathbb{E}{\pi{\theta}}[f(S,A)\nabla_{\theta} \log \pi_{\theta}(S,A)]$$

Now consider $\nabla_{\theta}J(\theta)$ formally, we have the following Policy Gradient Theorem:

5. Actor-Critic Algorithm

  1. Most basic Q Actor-Critic
  1. Advantage Actor-Critic
  1. TD Actor-Critic
  1. Natural Actor-Critic