# Markov decision process (MDP)

(50 minutes to learn)

## Summary

A Markov Decision Process (MDP) is a mathematical framework for handling search/planning problems where the outcome of actions are uncertain (non-deterministic). MDPs aim to maximize the expected utility (minimize the expected loss) throughout the search/planning.

## Context

This concept has the prerequisites:

- conditional probability (MDP transition probabilities are conditional probabilities)
- expectation and variance (the optimal policy for a MDP is defined in terms of an expected reward)

## Goals

- Understand the four components that comprise an MDP

- Understand the following terminology: states, actions, rewards, transition functions/probabilities, policy, utility, discount parameter, q-states, value

- Understand the recursive definition of value

- Understand the justification for discount parameters (especially with infinite time MDPs)

## Core resources (read/watch one of the following)

## -Free-

→ Berkeley Artificial Intelligence CS188 (2013)

Other notes:

- The lecture provides definitions, descriptions, and examples of MDPs -- it briefly covers value iteration (a method for finding optimal policies) near the end of the lecture

→ EdX Artificial Intelligence

Location:
Week 7 - Lecture 8: Markov Decision Processes (part 1-3 and quiz 1-2)

Other notes:

- navigate between lecture material using the slider at the top

## -Paid-

→ Artificial Intelligence: a Modern Approach

A textbook giving a broad overview of all of AI.

Location:
Section 17.1 p. 645-651

## Supplemental resources (the following are optional, but you may find them useful)

## -Free-

→ Wikipedia

## See also

- There are several methods for finding the optimal policy for an MDP:
- Value iteration is an application of dynamic programming that recursively computes the value function. It does not scale well as it has a complexity of O(S^2A) for S states and A actions.
- Policy iteration is similar to value iteration, but it alternates between determining the value function given a fixed policy and choosing a policy given a fixed value function. It tends to converge much quicker than value iteration.