SARSA

Q-learning asks what the best possible next action is, while SARSA asks what it will actually do next. That single difference changes everything about how the algorithm learns, what it converges to, and where an adversary can intervene.

The previous article in this series covered Q-learning, which is a lookup table that teaches itself optimal decisions by always assuming it will act optimally in the future. SARSA drops that assumption. It updates its value estimates based on the action the agent actually takes, including the exploratory ones. The result is an algorithm that learns the value of the policy it is following, not the policy it wishes it were following.

This is the twelfth entry in the AI red teaming series. If Q-learning taught us how an RL agent learns an idealised strategy, SARSA teaches us what happens when the agent’s own uncertainty gets baked into the values it trusts.

The update rule, and why it matters

Both Q-learning and SARSA maintain a Q-table, which is a matrix mapping every state-action pair to an estimated value. Both update that table iteratively as the agent interacts with the environment. The difference is in what goes into the update.

Q-learning’s update rule:

Q(s, a) <- Q(s, a) + α * (r + γ * max Q(s', a*) - Q(s, a))

SARSA’s update rule:

Q(s, a) <- Q(s, a) + α * (r + γ * Q(s', a') - Q(s, a))

The distinction sits in that second term. Q-learning uses max Q(s', a*): the highest Q-value available in the next state, regardless of what the agent will actually do. SARSA uses Q(s', a'): the Q-value of the action the agent actually selects in the next state, following whatever policy it is currently running.

In practical terms, Q-learning updates its values as though it will behave optimally from the next step onwards. SARSA updates its values based on what it will actually do, exploration and all. If the agent is running an epsilon-greedy policy with epsilon at 0.1, then 10% of the time it will take a random action. Q-learning ignores that randomness when updating. SARSA does not.

On-policy vs off-policy

This is the formal distinction. Q-learning is off-policy because it learns the value of the optimal policy while following a different exploratory policy. In contrast, SARSA is on-policy, meaning it learns the value of the policy it is currently executing.

The implications for how each algorithm behaves are significant.

Q-learning tends toward optimistic value estimates. Because it always references the maximum Q-value in the next state, it evaluates every state as though the agent will make the best possible decision from that point forward. In environments with stochastic transitions or risky states near high-reward paths, this optimism can cause the agent to repeatedly walk into danger. It has learned that the optimal path is valuable, but it has not accounted for the fact that its own exploration will occasionally send it off a cliff.

SARSA produces more conservative estimates. Because it factors in the actual next action, including random exploratory moves, it learns values that reflect the true expected return under the current policy. If the current policy sometimes stumbles into a penalty state, SARSA’s Q-values for nearby states will be lower. The agent learns to give dangerous areas a wider berth.

The classic illustration is a cliff-walking grid. The optimal path runs along the edge of a cliff. Q-learning learns that the edge path is optimal and walks it, but the agent’s own epsilon-greedy exploration occasionally pushes it off the cliff during training. SARSA learns to avoid the cliff edge entirely, taking a longer but safer route, because its value estimates account for the probability that exploration will send it over the edge.

Where SARSA appears in practice

SARSA’s conservatism makes it a natural fit for environments where safety during training matters, not just safety of the final policy. Robotics is the most obvious application because a physical robot cannot afford to explore a strategy that involves collisions, even temporarily. SARSA’s on-policy learning means the agent avoids high-risk regions even while it is still learning, because the Q-values themselves encode the risk of the exploration policy.

Autonomous driving simulation, industrial control, and any RL application where the training environment is the production environment (or close to it) tends to favour on-policy methods. The learned policy is never more aggressive than the agent’s actual behaviour during training.

In security-adjacent contexts, SARSA-style on-policy learning shows up in adaptive intrusion detection systems, dynamic firewall rule optimisation, and automated incident response workflows where the agent cannot afford to “explore” by ignoring a genuine alert.

The adversarial surface

Understanding SARSA’s on-policy nature is where the red teaming angle sharpens. The attack surfaces are different from Q-learning, and in some cases more accessible.

Epsilon manipulation

SARSA’s learned Q-values are a direct function of the exploration rate. Change epsilon, and you change the policy the algorithm learns. If epsilon is high, the agent explores aggressively, and SARSA’s Q-values will reflect the frequent penalties from random actions. The resulting policy will be overly cautious. If epsilon is artificially low, the agent locks in on early good-enough paths and never discovers better ones.

An attacker who can influence the exploration rate does not need to touch the reward signal or the environment dynamics. They can degrade the learned policy purely by skewing how much the agent explores. This is a subtler attack than reward poisoning and harder to detect, because the training loop still looks normal. The agent is learning the wrong thing.

Policy stagnation attacks

Because SARSA evaluates the policy it is currently following, it is susceptible to a feedback loop: a suboptimal policy generates Q-values that reinforce the suboptimal policy. If an attacker can nudge the agent into a poor early policy (through initial state manipulation, reward shaping in the first few episodes, or environmental setup that makes bad paths look acceptable early on), SARSA may never recover. Q-learning is more resilient to this because it evaluates the optimal action regardless of what the agent is currently doing.

On-policy reward poisoning

Reward poisoning works differently against SARSA than against Q-learning. With Q-learning, poisoning a reward in one state affects the max Q-value propagation chain and can shift the entire learned policy toward a target state. With SARSA, the poisoned reward only affects the update for the state-action pair that the agent actually visits under its current policy. This means reward poisoning against SARSA requires more targeted placement: the attacker needs to poison rewards along paths the agent is actually taking, not paths it theoretically could take.

The flip side is that once SARSA’s policy is successfully poisoned, the on-policy feedback loop makes recovery harder. The agent’s Q-values are entangled with its exploration behaviour. Correcting the reward does not immediately correct the values, because the agent is still following the poisoned policy and therefore still visiting the poisoned state-action pairs.

Exploration strategy substitution

SARSA’s choice of exploration strategy (epsilon-greedy, softmax, or other methods) is part of the algorithm’s attack surface in a way it is not for Q-learning.

Epsilon-greedy with a fixed epsilon means the agent always has a flat probability of taking a random action, and SARSA’s Q-values permanently reflect that randomness. An attacker who can substitute the exploration function (replacing epsilon-greedy with a biased sampling method, for example) directly shapes the Q-values without touching the reward or transition dynamics.

Softmax exploration assigns action probabilities proportional to their Q-values, which creates a tighter coupling between the learned values and the exploration behaviour. For an attacker, this means that small perturbations to early Q-values can cascade through the softmax distribution and shift exploration patterns for the rest of training. The feedback loop between softmax probabilities and SARSA’s on-policy updates makes this self-reinforcing.

Convergence and the parameters that control it

SARSA converges to an optimal policy under two conditions: the learning rate decays to zero over time (but slowly enough that every state-action pair is updated infinitely often), and the exploration strategy visits every state-action pair infinitely often. In practice, “infinitely often” means “enough times that the estimates stabilise.”

Two parameters dominate the convergence behaviour:

The learning rate (α) controls how aggressively the agent updates its Q-values. High α means the agent reacts strongly to each new experience, which speeds learning but introduces instability. Low α means the agent updates slowly, which stabilises learning but can trap it in suboptimal policies if early experience is misleading.

The discount factor (γ) controls how much the agent values future rewards relative to immediate ones. A γ close to 1 means the agent plans long-term. A γ close to 0 means the agent is myopic, chasing immediate payoffs. In security applications, γ determines whether an RL-driven system optimises for the current alert or for the long-term security posture.

For an adversary, these parameters are levers. If the learning rate is too high and you can inject a few high-magnitude rewards early in training, those values dominate the Q-table before the agent has seen enough data to stabilise. If the discount factor is high and you can create an attractive nuisance several steps away from the current state, SARSA will propagate that false reward backward through the Q-table.

The assumptions SARSA inherits

Like Q-learning, SARSA assumes the environment satisfies the Markov property, which means the next state depends only on the current state and action rather than on the history of previous states and actions. It also assumes the environment is stationary, meaning transition probabilities and reward functions do not change over time.

Both assumptions are exploitable. The Markov assumption means that any environmental context the agent cannot observe in its current state is invisible to its decision-making. An attacker who can create state aliasing (making dangerous states look identical to safe ones in the agent’s state representation) can cause SARSA to assign identical Q-values to states with very different risk profiles.

The stationarity assumption means SARSA does not expect the environment to change underneath it. An attacker who modifies the environment dynamics after training can invalidate the entire Q-table without the agent ever knowing. The agent continues to follow its learned policy, now operating on stale value estimates in a world that no longer matches them.

What SARSA teaches the red teamer

The Q-learning article in this series showed that an RL agent’s value estimates are the primary attack surface. SARSA adds a layer where those value estimates are entangled with the agent’s own behaviour. In Q-learning, you can separate “what the agent does” from “what the agent thinks is optimal.” In SARSA, those two things are the same.

This entanglement is SARSA’s defining property and its defining vulnerability. Every aspect of the agent’s exploration strategy, its parameter settings, and its early training experience gets permanently encoded into the Q-values it trusts. An attacker who understands on-policy learning does not need a dramatic intervention. A small, early nudge can compound through the feedback loop between policy and value estimates until the agent is confidently following a path the attacker designed.

The next time you encounter an RL system deployed in a safety-critical environment and someone tells you it uses on-policy learning for stability, ask which training episodes shaped its initial Q-values. The answer to that question tells you more about the system’s real security posture than any description of its architecture.

Type to search