Reinforcement Learning - Theoretical Foundations: Part I

2021-01-04Blogs

Introduction

Recently I've been learning about reinforcement learning from amazing lectures from David Silver. These provide an overview of the classical algorithms in RL and potential challenges for future researches, in the subsequent blogs, I'll talk about the major aspects of RL and provide some solid math details on how algorithms in RL is executed.

What is Reinforcement Learning

An RL agent may include one or more of these components:

Policy: agent’s behaviour function
- It is a map from state to action
- Deterministic policy: a =
- Stochastic policy:
Value function: how good is each state and/or action
- Value function is a prediction of future reward
- Used to evaluate the goodness/badness of states
- And therefore to select between actions
Model: agent’s representation of the environment
- A model predicts what the environment will do next
- predicts the next state
- predicts the next (immediate) reward, this often takes the form of expectation

It is derived for a classical problem - Exploration vs Exploitation. There is no supervisor, only a reward signal. Sometimes, feedback is delayed, not instantaneous. Time really matters (sequential, non i.i.d data); and Agent's actions affect the subsequent data it receives.

Prediction vs Control

RL problems is often classified into a prediction problem or a control problem

Prediction: Given a policy, evaluate the future
Control: Find the best policy

Markov Decision Process (MDP)

Before venturing into the exact algorithms, let's lay out some fundamental math concepts here.

Prior Knowledge

Basic Probability Knowledge
Basic Stochastic Process:
- Markov Chain
- Transition Matrix

Problem setup

This is an RL setting where the environment is fully observable
The current state completely characterises the process
Almost all RL problems can be formalised as MDPs (Bandit are MDP with 1 state & finite/infinite actions)

Terminologies

Markov Reward Process

A Markov reward process is a Markov chain with values.
In addition to , we have reward function and a discount factor

Return

The return is the total discounted reward from time-step :
Intuitively, the discount factor favours future rewards at a nearer date

State-value function

The state value function of an MRP is the expected return starting from state

Bellman Equation

We apply one-step analysis on and observe that:
Therefore, we find that value function of any state is only dependent on its outgoing states (successors)
We can then construct a matrix equation: , which can be solved in (hence very expensive)
For large state set, more efficient algorithms utilising Dynamic programming (DP) is preferred

MDP

A Markov decision process (MDP) is a Markov reward process with decisions
In addition to states, we have a set of actions
The probability and reward functions now conditionally depend on and simultaneous as follows

Policy

A policy is a distribution over actions given states:
A policy is stationary (the probability does not change for different iterations)

State-value function and Action-value function for MDP (Differs from 3.)

State-value function: the expectation taken w.r.t the policy
Action-value function:
Note that

Applying Bellman equation on and

Note that we cannot do simple one-step analysis since there are 2 variables and now.
Instead, we try to apply OSA on w.r.t , and then do OSA on w.r.t to get Bellman Expectation Equation on , and swap the 2 variables’ order to derive equation for
First step (Bellman Expectation Equation):
Second step (Bellman Optimality Equation):

Optimality

We try to maximize and (Goal 1)
Theorem suggests existence of policy that achieves Goal 1 deterministically.
An optimal policy can be found by choosing actions for each state such that the choices maximise over

Solving for optimality

The Bellman equations in 8 is often nonlinear and has no closed form in general
We often need to apply sequential methods to solve it
Value iteration
Policy Iteration
Q-Learning
Sarsa

To read more on extensions, refer to Page 49 of this slides.

Reinforcement Learning - Theoretical Foundations: Part I

https://criss-wang.github.io/post/blogs/rl/reinfocement-1/

Author

Zhenlin Wang

Posted on

2021-01-04

Updated on

2022-08-19

Reinforcement Learning - Theoretical Foundations: Part I

Introduction

What is Reinforcement Learning

Prediction vs Control

Markov Decision Process (MDP)

Prior Knowledge

Problem setup

Terminologies

Author

Posted on

Updated on

Licensed under

Tags

Catalogue