Blogs · Reinforcement Learning · Machine Learning

Reinforcement Learning: Theoretical Foundations, Part V

A practical overview of actor-critic methods, deep reinforcement learning, replay buffers, target networks, PPO-style updates, and evaluation concerns.

2021.01.20 · 1 min read · by Zhenlin Wang

Introduction

Actor-critic methods combine two ideas:

The critic helps reduce variance in policy updates, while the actor learns how to act.

Actor and Critic

The actor is:

$$ \pi_\theta(a \mid s) $$

The critic estimates:

$$ V_\phi(s) $$

or:

$$ Q_\phi(s,a) $$

The actor uses the critic’s estimates to improve the policy.

Temporal-Difference Learning

The critic often learns from temporal-difference targets:

$$ r + \gamma V(s’) $$

The TD error is:

$$ \delta = r + \gamma V(s’) - V(s) $$

This can be used as an advantage estimate for actor updates.

Deep RL Stabilizers

Deep RL adds neural networks, which makes function approximation powerful but unstable.

Common stabilizers:

PPO-Style Updates

Proximal Policy Optimization (PPO) limits how much the policy changes in one update. The goal is to improve the policy without taking destructive steps.

This is why PPO became a practical default in many RL settings: it balances simplicity and stability.

Evaluation

RL evaluation is tricky.

Track:

Always evaluate multiple random seeds. A single lucky run is not evidence.

Practical Warning

RL is often overkill. Use it when actions influence future states and delayed reward matters. If the task is static prediction, supervised learning is usually simpler and better.

Closing

Actor-critic methods are the bridge between value estimation and direct policy optimization. They are powerful, but their training dynamics require careful evaluation, multiple seeds, and strong baselines.