What Is SARSA? On-Policy Reinforcement Learning

SARSA is a reinforcement learning algorithm whose name stands for the five elements it uses at each step: State, Action, Reward, State, Action. It teaches a software agent to make decisions by repeatedly trying actions in an environment, observing the results, and updating a score sheet that tracks how good each action is in each situation. SARSA belongs to a family of methods called temporal difference learning, and its defining trait is that it learns from the actions the agent actually takes rather than from hypothetically perfect choices.

The Five Letters Explained

Each letter in SARSA maps to one piece of information the algorithm needs before it can learn from a single experience:

S (State): Where the agent is right now.
A (Action): What the agent decides to do in that state.
R (Reward): The immediate feedback the environment gives back, positive or negative.
S (next State): Where the agent ends up after taking that action.
A (next Action): What the agent decides to do in the new state.

Once the algorithm has all five pieces, it updates its internal score for the original state-action pair. Then the “next state” and “next action” become the current ones, and the cycle repeats until the task is done. This tuple of five values is why the algorithm got its name.

How the Score Updates Work

SARSA maintains a table of scores called Q-values. Each entry in the table corresponds to one combination of state and action. A higher Q-value means “this action in this state has historically led to better outcomes.” Every time the agent collects a full S-A-R-S-A tuple, it adjusts the Q-value for the current state-action pair using a simple formula.

The update works like this: the algorithm calculates a target by adding the immediate reward to a discounted version of the Q-value for the next state-action pair. It then compares that target to the current Q-value. If the target is higher, the Q-value gets nudged upward; if the target is lower, the Q-value gets nudged downward. Two settings control this process. The learning rate (often written as alpha) determines how large each nudge is, with values between 0 and 1. A learning rate of 0.81, for example, means the algorithm shifts about 81% of the way toward the new target. The discount factor (gamma) controls how much the agent cares about future rewards versus immediate ones. A discount factor of 0.96 means future rewards are nearly as valuable as immediate ones, while a value closer to 0 would make the agent shortsighted.

Over hundreds or thousands of episodes, these small adjustments cause the Q-values to settle on numbers that accurately reflect how valuable each action is in each state. The agent can then pick the highest-scoring action in any given state and follow a strong policy.

On-Policy Learning: SARSA’s Core Feature

The most important thing that separates SARSA from other algorithms is that it is on-policy. This means the agent learns from the actions it actually takes, including the random exploratory moves. The policy being improved and the policy generating the data are the same.

Contrast this with Q-learning, the most well-known alternative. Q-learning is off-policy: when it updates its scores, it assumes the agent will always pick the best possible action in the next state, even if the agent didn’t actually do that. Q-learning uses the maximum Q-value from the next state regardless of what action was chosen. SARSA instead plugs in the Q-value of whatever action the agent really selected, random or not.

This distinction has practical consequences. Because Q-learning optimistically assumes perfect future decisions, it tends to produce aggressive policies. An agent trained with Q-learning in a grid world with cliffs, for instance, will learn to walk right along the cliff edge because it assumes it will never accidentally step off. SARSA, knowing that the agent sometimes takes random exploratory steps, learns a safer path that stays further from danger. In environments where mistakes are costly, SARSA’s realistic accounting of exploration typically produces more stable behavior.

Balancing Exploration and Exploitation

SARSA commonly uses a strategy called epsilon-greedy to choose actions. Most of the time the agent picks the action with the highest Q-value (exploitation), but with a small probability, epsilon, it picks a completely random action instead (exploration). If epsilon is set to 0.1, the agent explores randomly 10% of the time and follows its best knowledge the other 90%.

Exploration matters because the agent starts with no knowledge. If it only ever repeated its first successful strategy, it might miss a far better one. Random moves force the agent to try unfamiliar state-action pairs, filling in blank spots in the Q-table. Over time, epsilon is often reduced so the agent explores less and exploits more as its knowledge improves. SARSA is mathematically guaranteed to converge on an optimal policy as long as every state-action pair gets visited enough times and epsilon gradually shrinks toward zero, for example by setting epsilon to 1 divided by the episode number.

Walking Through One Episode

Imagine a robot navigating a simple maze. The algorithm begins by initializing all Q-values to zero (or small random numbers). At the start of an episode, the robot observes its current state, say the top-left corner of the maze, and picks an action using the epsilon-greedy rule. Suppose it moves right.

After moving, the robot receives a reward (maybe -1 for each step, to encourage finding the exit quickly) and lands in a new state. Before updating anything, the robot also selects its next action from this new state using the same epsilon-greedy rule. Now it has all five elements: original state, original action, reward, new state, new action. It runs the update formula, adjusting the Q-value for “top-left corner, move right.”

The new state and new action then become the current state and current action, the robot takes the step, and the cycle continues. When the robot reaches the maze exit (a terminal state), the episode ends. A new episode starts from the beginning, and the robot gradually builds better Q-values. After enough episodes, the Q-table encodes a reliable map of the best action to take in every position.

Where SARSA Is Used

SARSA works well in environments with discrete states and actions where safety or realistic behavior matters. Classic applications include grid-world navigation problems, game playing, and simple robotic control tasks. It is often one of the first algorithms taught in reinforcement learning courses because it clearly illustrates how temporal difference methods work and because its on-policy nature makes the relationship between exploration and learning easy to observe.

For problems with very large or continuous state spaces, tabular SARSA becomes impractical because the Q-table would need too many entries. In those cases, function approximation methods (like neural networks) can replace the table, leading to variants sometimes called deep SARSA. The core update logic stays the same; only the way Q-values are stored and estimated changes.