Reinforcement Learning - Theoretical Foundations: Part II
Dynamic Programming in RL
Introduction
- DP assumes full knowledge of the MDP
- A prediction problem: The input is an MDP/MRP and a policy
. The output is a value function . - A control problem: The input is an MDP. The output is the optimal value function
and an optimal policy .
Synchronous DP
The following table summarizes the type of problems that is solved synchronously via iteration/evaluation algorithms:
Problem | Bellman Equation | Algorithm |
---|---|---|
Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |
Control | Bellman Expectation Equation Policy Iteration + Greedy Policy Improvement | Policy Iteration |
Control | Bellman Optimality Equation | Value Iteration |
Iterative Policy Evaluation
- Problem: evaluate a given policy
- Algo sketch:
- Assign each state with an initial value (for example:
) - Following the policy, compute the updated value function
using the Bellman Expectation Equation - Iterate until convergence (proven later)
- Assign each state with an initial value (for example:
Policy Improvement
Upon Evaluation of a policy
, we can seek to greedily improve the policy such that we obtain . (expr 1) The greedy approach acts as selecting
. (eq 1) We can prove that eq 1 leads to expr 1 as follows:
In one step:
. Note that
is a deterministic policy. Observe that Hence
Basically, we find that this method is equivalent to solving the Bellman Optimality equation. So we obtain
as an optimal policy Note that this process of policy iteration always converges to
. Drawback: Policy Iteration always Evaluation an entire Policy before it starts to improve on the policy. This may be highly inefficient if the evaluation of a policy takes very long time (e.g. infinite MDP)
To deal with the Drawback, we utilise DP
Value Iteration.
Value Iteration
- We improve the value function
in each iteration - Note that we are only improving the value function, where this value function is based on any explicit policy
- Intuition: start with final rewards (again all 0 for example) and work backwards
- Now assume we know the solution to a subproblem
, then we can find by one-step look ahead: - Therefore, we can always update the value function in each iteration backwards until convergence.
Contraction Mapping Theorem
- To be updated upon publishing the markdown
- Refer to page 28 - 42 (DP)
Asynchronous DP
There are 3 simple ideas, which I haven't learning in detail:
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming
Reinforcement Learning - Theoretical Foundations: Part II