Markov decision process (MDP)

(50 minutes to learn)


A Markov Decision Process (MDP) is a mathematical framework for handling search/planning problems where the outcome of actions are uncertain (non-deterministic). MDPs aim to maximize the expected utility (minimize the expected loss) throughout the search/planning.


This concept has the prerequisites:


  • Understand the four components that comprise an MDP
  • Understand the following terminology: states, actions, rewards, transition functions/probabilities, policy, utility, discount parameter, q-states, value
  • Understand the recursive definition of value
  • Understand the justification for discount parameters (especially with infinite time MDPs)

Core resources (read/watch one of the following)


Berkeley Artificial Intelligence CS188 (2013)
Author: Pieter Abbeel
Other notes:
  • The lecture provides definitions, descriptions, and examples of MDPs -- it briefly covers value iteration (a method for finding optimal policies) near the end of the lecture
EdX Artificial Intelligence
Location: Week 7 - Lecture 8: Markov Decision Processes (part 1-3 and quiz 1-2)
Authors: Pieter Abbeel,Dan Klein
Other notes:
  • navigate between lecture material using the slider at the top


Supplemental resources (the following are optional, but you may find them useful)

See also

  • There are several methods for finding the optimal policy for an MDP:
    • Value iteration is an application of dynamic programming that recursively computes the value function. It does not scale well as it has a complexity of O(S^2A) for S states and A actions.
    • Policy iteration is similar to value iteration, but it alternates between determining the value function given a fixed policy and choosing a policy given a fixed value function. It tends to converge much quicker than value iteration.
    The Bellman equations characterize the optimal values for MDPs and are ubiquitous in reinforcement learning.