# Reinforcement Learning - Theoretical Foundations: Part II

## Dynamic Programming in RL

### Introduction

- DP assumes full knowledge of the MDP
**A prediction problem**: The input is an MDP/MRP and a policy. The output is a value function . **A control problem**: The input is an MDP. The output is the optimal value functionand an optimal policy .

### Synchronous DP

The following table summarizes the type of problems that is solved synchronously via iteration/evaluation algorithms:

Problem | Bellman Equation | Algorithm |
---|---|---|

Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |

Control | Bellman Expectation Equation Policy Iteration + Greedy Policy Improvement | Policy Iteration |

Control | Bellman Optimality Equation | Value Iteration |

### Iterative Policy Evaluation

- Problem: evaluate a given policy
**Algo sketch**:- Assign each state with an initial value (for example:
) - Following the policy, compute the updated value function
using the Bellman Expectation Equation - Iterate until convergence (proven later)

- Assign each state with an initial value (for example:

### Policy Improvement

Upon Evaluation of a policy

, we can seek to greedily improve the policy such that we obtain . ( **expr 1**)The greedy approach acts as selecting

. ( **eq 1**)We can prove that

**eq 1**leads to**expr 1**as follows:In one step:

. Note that

is a deterministic policy. Observe that Hence

Basically, we find that this method is equivalent to solving the Bellman Optimality equation. So we obtain

as an optimal policy Note that this process of policy iteration always converges to

. **Drawback**: Policy Iteration always Evaluation an entire Policy before it starts to improve on the policy. This may be highly inefficient if the evaluation of a policy takes very long time (e.g. infinite MDP)To deal with the

**Drawback**, we utilise**DP**Value Iteration.

### Value Iteration

- We improve the value function
in each iteration - Note that we are
**only**improving the value function, where this value function is based on any explicit policy - Intuition: start with final rewards (again all 0 for example) and work backwards
- Now assume we know the solution to a subproblem
, then we can find by one-step look ahead: - Therefore, we can always update the value function in each iteration backwards until convergence.

### Contraction Mapping Theorem

- To be updated upon publishing the markdown
- Refer to page 28 - 42 (DP)

### Asynchronous DP

There are 3 simple ideas, which I haven't learning in detail:

- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming

Reinforcement Learning - Theoretical Foundations: Part II