Markov decision process (MDP)
(50 minutes to learn)
Summary
A Markov Decision Process (MDP) is a mathematical framework for handling search/planning problems where the outcome of actions are uncertain (non-deterministic). MDPs aim to maximize the expected utility (minimize the expected loss) throughout the search/planning.
Context
This concept has the prerequisites:
- conditional probability (MDP transition probabilities are conditional probabilities)
- expectation and variance (the optimal policy for a MDP is defined in terms of an expected reward)
Goals
- Understand the four components that comprise an MDP
- Understand the following terminology: states, actions, rewards, transition functions/probabilities, policy, utility, discount parameter, q-states, value
- Understand the recursive definition of value
- Understand the justification for discount parameters (especially with infinite time MDPs)
Core resources (read/watch one of the following)
-Free-
→ Berkeley Artificial Intelligence CS188 (2013)
Other notes:
- The lecture provides definitions, descriptions, and examples of MDPs -- it briefly covers value iteration (a method for finding optimal policies) near the end of the lecture
→ EdX Artificial Intelligence
Location:
Week 7 - Lecture 8: Markov Decision Processes (part 1-3 and quiz 1-2)
Other notes:
- navigate between lecture material using the slider at the top
-Paid-
→ Artificial Intelligence: a Modern Approach
A textbook giving a broad overview of all of AI.
Location:
Section 17.1 p. 645-651
Supplemental resources (the following are optional, but you may find them useful)
-Free-
→ Wikipedia
See also
- There are several methods for finding the optimal policy for an MDP:
- Value iteration is an application of dynamic programming that recursively computes the value function. It does not scale well as it has a complexity of O(S^2A) for S states and A actions.
- Policy iteration is similar to value iteration, but it alternates between determining the value function given a fixed policy and choosing a policy given a fixed value function. It tends to converge much quicker than value iteration.