What Is a Markov Decision Process and How It Works?

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under your control. It gives you a structured way to figure out the best action to take at every step when you can’t predict exactly what will happen next. MDPs are the foundation of reinforcement learning and show up in fields from robotics to finance to game design.

The Four Building Blocks

An MDP is built from four components that together describe a decision problem completely.

States represent every possible situation the decision-maker could be in. If you’re navigating a robot through a warehouse, each location on the floor is a state. In a board game, each board configuration is a state.

Actions are the choices available in each state. The robot might move forward, turn left, or turn right. A stock trader might buy, sell, or hold.

Transition probabilities capture the uncertainty. When you take an action in a given state, the transition function tells you the probability of ending up in each possible next state. A robot told to move forward might have a 90% chance of actually moving forward and a 10% chance of drifting slightly to one side. This is where the “Markov” part comes in: the probability of reaching the next state depends only on your current state and action, not on anything that happened before. The system has no memory of its history.

Rewards assign a numerical score to each state-action pair. Reaching the goal might earn +100, bumping into a wall might cost -10, and every ordinary step might cost -1 to encourage efficiency. The entire point of solving an MDP is to find the sequence of actions that earns the most total reward over time.

How Rewards Get Weighed Over Time

In most MDPs, you don’t just add up rewards equally. A parameter called the discount factor (represented by the Greek letter gamma) controls how much you care about future rewards compared to immediate ones. It ranges between 0 and 1. A common choice in practice is 0.99.

When the discount factor is close to 1, the decision-maker is patient and plans far ahead, treating a reward earned 100 steps from now almost as seriously as one earned right now. When it’s closer to 0, the decision-maker becomes short-sighted, prioritizing immediate payoffs and mostly ignoring what happens later. Setting it below 0.5 is generally too aggressive, because the agent barely considers long-term consequences at all. In environments with a lot of unpredictability, a large discount factor can actually backfire, because it forces the system to estimate distant, uncertain outcomes that are hard to predict accurately.

The total reward an agent expects to collect from a given point forward, with each future reward shrunk by the discount factor, is called the return. If you earn rewards of 5, 3, and 8 over three steps with a discount factor of 0.9, the return from the first step is 5 + (0.9 × 3) + (0.81 × 8) = 14.18.

Policies and Value Functions

A policy is a rule that tells the decision-maker what action to take in every state. Finding the best policy is the whole goal of solving an MDP. Policies can be deterministic, always choosing the same action in a given state, or stochastic, choosing actions according to probabilities. A stochastic robot navigating a cluttered room, for example, might favor routes through open spaces rather than narrow corridors, because the randomness in its movement makes tight paths risky.

To evaluate how good a policy is, MDPs use two types of value functions. The state-value function answers: “If I’m in this state and follow this policy from here on, how much total reward can I expect?” The action-value function is more specific: “If I’m in this state, take this particular action, and then follow the policy afterward, how much total reward can I expect?” The action-value function is especially useful because comparing its values across actions in a state directly tells you which action is best.

These value functions are connected by a key relationship called the Bellman equation, which says the value of any state equals the immediate reward you expect plus the discounted value of whatever state you end up in next. This recursive structure is what makes MDPs computationally solvable: instead of simulating every possible future, you can work backward from the relationships between neighboring states.

Solving an MDP

Two classical algorithms find the optimal policy for an MDP.

Value iteration starts with an arbitrary guess for the value of every state, then repeatedly updates each state’s value using the Bellman equation. Each pass through the states brings the estimates closer to the true values, and the process is mathematically guaranteed to converge. Once you have accurate values for every state, the best policy falls out naturally: in each state, pick the action that leads to the highest-value next state.

Policy iteration takes a different approach. It starts with a random policy, fully evaluates how good that policy is (computing the value of every state under it), then improves the policy by switching each state’s action to whichever one the evaluation says is best. It repeats this evaluate-then-improve cycle until the policy stops changing. Policy iteration often converges in fewer rounds than value iteration, but each round is more expensive because it requires a full evaluation of the current policy.

Both methods work well when the number of states is manageable. For problems with enormous or continuous state spaces, like controlling a self-driving car where the “state” includes speed, position, and the location of every nearby object, these exact methods become impractical. That’s where approximate methods from reinforcement learning, such as deep Q-networks, step in to estimate value functions using neural networks instead of storing a value for every state individually.

Finite vs. Infinite Horizons

MDPs come in two main flavors depending on whether the decision-making has an endpoint. In a finite-horizon MDP, the process runs for a fixed number of steps and then stops. Board games are a natural example: the game ends when someone wins. In this setting, the optimal policy can change depending on how many steps remain. Early in a chess game you might play conservatively, but with only a few moves left, the best strategy shifts.

In an infinite-horizon MDP, there’s no predetermined endpoint. The agent keeps making decisions indefinitely, and the discount factor ensures that the total expected reward stays finite rather than growing without bound. Most reinforcement learning applications use this formulation. A robot managing inventory in a warehouse, for instance, doesn’t have a set retirement date.

You can actually convert a finite-horizon problem into an infinite-horizon one by adding an absorbing state, a dead end that the agent enters once time runs out and where it earns zero reward forever after. This trick lets you use the same solution methods for both types.

When You Can’t See the Full Picture

Standard MDPs assume you always know exactly what state you’re in. In many real problems, that’s not the case. A medical robot might know a patient’s visible symptoms but not the underlying disease. A poker-playing agent can see its own cards but not its opponents’. These situations call for a Partially Observable Markov Decision Process, or POMDP.

In a POMDP, the underlying world still follows the same rules as an MDP, but the agent receives indirect observations rather than seeing the true state. It must maintain a belief, essentially a probability distribution over which state it thinks it might be in, and update that belief as new observations come in. POMDPs can also plan information-gathering actions, like running a diagnostic test to narrow down possibilities, something standard MDPs can’t represent.

The tradeoff is computational cost. Standard MDPs are relatively tractable to solve. POMDPs are enormously more expensive, and solving them optimally for large problems remains one of the harder challenges in AI research.

Where MDPs Are Used

MDPs show up anywhere a system needs to make a sequence of decisions under uncertainty. In robotics, they plan paths through environments where wheels might slip or sensors might misread. In networking, they route data packets through systems where link congestion fluctuates unpredictably. Elevator scheduling systems use them to decide which floors to visit next based on current and anticipated demand. Aircraft navigation, manufacturing process control, and travel route optimization are all natural MDP applications.

In finance, MDPs model portfolio decisions where market movements are uncertain but follow probabilistic patterns. In game design and game-playing AI, they provide the backbone for agents that learn strategies through trial and error. The explosion of reinforcement learning over the past decade, from AlphaGo to robotic manipulation, is built directly on the MDP framework.