Reinforcement Learning - Theoretical Foundations: Part II

2021-01-05Blogs

Dynamic Programming in RL

DP assumes full knowledge of the MDP
A prediction problem: The input is an MDP/MRP and a policy . The output is a value function .
A control problem: The input is an MDP. The output is the optimal value function and an optimal policy .

The following table summarizes the type of problems that is solved synchronously via iteration/evaluation algorithms:

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Iterative Policy Evaluation
Control	Bellman Expectation Equation Policy Iteration + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimality Equation	Value Iteration

Problem: evaluate a given policy
Algo sketch:
1. Assign each state with an initial value (for example: )
2. Following the policy, compute the updated value function using the Bellman Expectation Equation
3. Iterate until convergence (proven later)

Upon Evaluation of a policy , we can seek to greedily improve the policy such that we obtain . (expr 1)
The greedy approach acts as selecting . (eq 1)
We can prove that eq 1 leads to expr 1 as follows:
- In one step: .
- Note that is a deterministic policy. Observe that
- Hence
Basically, we find that this method is equivalent to solving the Bellman Optimality equation. So we obtain as an optimal policy
Note that this process of policy iteration always converges to .
Drawback: Policy Iteration always Evaluation an entire Policy before it starts to improve on the policy. This may be highly inefficient if the evaluation of a policy takes very long time (e.g. infinite MDP)
To deal with the Drawback, we utilise DP Value Iteration.

We improve the value function in each iteration
Note that we are only improving the value function, where this value function is based on any explicit policy
Intuition: start with final rewards (again all 0 for example) and work backwards
Now assume we know the solution to a subproblem , then we can find by one-step look ahead:
Therefore, we can always update the value function in each iteration backwards until convergence.

There are 3 simple ideas, which I haven't learning in detail:

Reinforcement Learning - Theoretical Foundations: Part II

Zhenlin Wang

2021-01-05

2022-08-19