Reinforcement Learning - Theoretical Foundations: Part II

Dynamic Programming in RL


  • DP assumes full knowledge of the MDP
  • A prediction problem: The input is an MDP/MRP and a policy . The output is a value function .
  • A control problem: The input is an MDP. The output is the optimal value function and an optimal policy .

Synchronous DP

The following table summarizes the type of problems that is solved synchronously via iteration/evaluation algorithms:

Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iterative Policy Evaluation
Control Bellman Expectation Equation Policy Iteration + Greedy Policy Improvement Policy Iteration
Control Bellman Optimality Equation Value Iteration

Iterative Policy Evaluation

  • Problem: evaluate a given policy
  • Algo sketch:
    1. Assign each state with an initial value (for example: )
    2. Following the policy, compute the updated value function using the Bellman Expectation Equation
    3. Iterate until convergence (proven later)

Policy Improvement

  • Upon Evaluation of a policy , we can seek to greedily improve the policy such that we obtain . (expr 1)

  • The greedy approach acts as selecting . (eq 1)

  • We can prove that eq 1 leads to expr 1 as follows:

    • In one step: .

    • Note that is a deterministic policy. Observe that

    • Hence

  • Basically, we find that this method is equivalent to solving the Bellman Optimality equation. So we obtain as an optimal policy

  • Note that this process of policy iteration always converges to .

  • Drawback: Policy Iteration always Evaluation an entire Policy before it starts to improve on the policy. This may be highly inefficient if the evaluation of a policy takes very long time (e.g. infinite MDP)

  • To deal with the Drawback, we utilise DP Value Iteration.

Value Iteration

  • We improve the value function in each iteration
  • Note that we are only improving the value function, where this value function is based on any explicit policy
  • Intuition: start with final rewards (again all 0 for example) and work backwards
  • Now assume we know the solution to a subproblem , then we can find by one-step look ahead:
  • Therefore, we can always update the value function in each iteration backwards until convergence.

Contraction Mapping Theorem

  • To be updated upon publishing the markdown
  • Refer to page 28 - 42 (DP)

Asynchronous DP

There are 3 simple ideas, which I haven't learning in detail:

  • In-place dynamic programming
  • Prioritised sweeping
  • Real-time dynamic programming

Reinforcement Learning - Theoretical Foundations: Part II


Zhenlin Wang

Posted on


Updated on


Licensed under