Q-learning

An agent sits in a grid. It knows nothing about the grid. It picks a direction, moves, and receives a number. That number is the only feedback it will ever get. After thousands of iterations, the agent navigates the grid perfectly. The previous article in this series covered reinforcement learning at the conceptual level. It explored agents, environments, rewards, and the fundamental vulnerability that RL optimizes for whatever signal you give it. Q-learning is where that abstraction becomes concrete. It is the simplest algorithm that turns the RL loop into a working system, and understanding its internals reveals exactly how an attacker would go about corrupting one.

This is the eleventh entry in the AI red teaming series. We are moving from “what is reinforcement learning” to “how does a specific RL algorithm actually learn, and where does that process break.”

What Q-learning actually does

Q-learning is a model-free reinforcement learning algorithm. “Model-free” means the agent does not build an internal representation of how the environment works. It does not learn transition probabilities or predict what state will follow from a given action. It learns something narrower and more direct by calculating what cumulative reward it should expect for each state it can be in and each action it can take from that state.

That expected reward is called the Q-value. The “Q” stands for quality. A high Q-value for a state-action pair means the agent has learned, through repeated experience, that taking that action in that state leads to good outcomes.

The distinction from model-based methods matters for red teaming. A model-based agent builds a map of the world and plans against it. A model-free agent has no map. It has a cheat sheet. If you corrupt the map, a model-based agent might detect inconsistencies. If you corrupt the cheat sheet, a model-free agent has nothing to cross-reference against. It just follows the corrupted values.

The Q-table

The Q-table is a two-dimensional structure. Rows are states. Columns are actions. Each cell holds the Q-value for that state-action pair.

For a grid world where a robot can move in four directions, it looks like this:

State	Up	Down	Left	Right
S1	-1.0	0.0	-0.5	0.2
S2	0.0	1.0	0.0	-0.3
S3	0.5	-0.5	1.0	0.0
S4	-0.2	0.0	-0.3	1.0

The agent’s entire policy lives in this table. At any given state, it looks up the row, finds the action with the highest Q-value, and takes it. In S2, the agent moves down (Q-value 1.0). In S3, it moves left (Q-value 1.0). In S4, it moves right (Q-value 1.0).

From a red teaming perspective, the Q-table is the target. It is a flat lookup table, and every value in it is readable, interpretable, and (if you have access) writable. Change one cell, and you change one decision. Change a column, and you redirect the agent’s behaviour across every state. The Q-table is both the agent’s brain and its single point of failure.

How Q-values update

The Q-table does not arrive fully formed. It starts empty (usually initialised to zeros) and is updated through experience. Each time the agent takes an action and observes the result, it adjusts the relevant Q-value using a specific update rule:

Q(s, a) = Q(s, a) + α * [r + γ * max(Q(s', a')) - Q(s, a)]

Five components, each with a distinct role.

Q(s, a) is the current Q-value for taking action a in state s. This is what the agent currently believes about this state-action pair.

α (alpha) is the learning rate. It controls how much weight the agent gives to new information versus old. An alpha of 0.1 means the agent adjusts its belief by 10% of the difference between what it expected and what it observed. An alpha of 1.0 means the agent completely overwrites the old value with the new one.

r is the immediate reward received after taking the action. This is the environment’s feedback signal.

γ (gamma) is the discount factor, between 0 and 1. It determines how much the agent values future rewards relative to immediate ones. A gamma of 0.9 means a reward one step in the future is worth 90% of the same reward received now. A gamma of 0.1 means the agent is almost entirely short-sighted.

max(Q(s’, a’)) is the highest Q-value available in the next state. This is the agent’s estimate of the best possible future from the state it just landed in.

The update rule works by calculating the difference between what the agent expected (the current Q-value) and what it actually experienced (the immediate reward plus the discounted best future). That difference, scaled by the learning rate, is added to the current value.

Here is the update in practice. The robot is in S1, takes action Right, and arrives in S2 with a reward of 0.5. Learning rate is 0.1, discount factor is 0.9.

Q(S1, Right) = 0.2 + 0.1 * [0.5 + 0.9 * 1.0 - 0.2]
Q(S1, Right) = 0.2 + 0.1 * 1.2
Q(S1, Right) = 0.32

The Q-value for (S1, Right) moves from 0.2 to 0.32. The agent now believes that moving right from S1 is slightly better than it previously thought.

Where the parameters become attack surface

Each of the five components in the update rule is a potential attack vector if the adversary can influence it.

Reward manipulation is the most direct. The previous article covered this at the conceptual level. In Q-learning, the mechanism is specific: every reward signal the environment returns directly modifies a Q-value. If an attacker can inject false rewards (or modify legitimate ones in transit), they can inflate Q-values for actions that serve their objective and deflate values for actions that don’t. The agent will converge on a policy the attacker designed, and it will do so through the algorithm’s own update mechanism. The corruption is indistinguishable from normal learning.

Learning rate exploitation is subtler. A high alpha makes the agent responsive to recent experience, which means a short burst of manipulated rewards can overwrite thousands of episodes of legitimate learning. A low alpha makes the agent resistant to manipulation but also slow to adapt to genuine environmental changes. An attacker who knows the learning rate can calculate exactly how many corrupted episodes are needed to shift a Q-value by a specific amount.

Discount factor abuse targets the agent’s time horizon. An agent with a high gamma (0.99) plans far ahead and is harder to redirect with short-term reward spikes. An agent with a low gamma (0.1) is myopic and can be lured into locally rewarding traps. If an attacker can modify gamma, they can collapse the agent’s planning horizon entirely, making it chase immediate rewards while ignoring long-term consequences.

State poisoning targets the input side. If the attacker can corrupt the agent’s perception of which state it is in (by manipulating sensor data, altering observations, or modifying state encoding), the agent will update Q-values for the wrong state. The table entry that gets modified has nothing to do with the agent’s actual situation. Over time, this corrupts the Q-table without ever touching the reward signal.

Next-state manipulation targets the max(Q(s', a')) term. If the attacker can influence which state the agent transitions to after an action (by modifying the environment’s dynamics), they control the future-reward estimate that feeds into every update. This is harder to achieve than reward manipulation, but more powerful: it corrupts the temporal credit assignment that makes Q-learning work.

The algorithm step by step

The full Q-learning loop runs as follows.

Initialise the Q-table (usually all zeros). Every state-action pair starts with no preference.
Observe the current state.
Choose an action using the exploration strategy (more on this below).
Execute the action, observe the reward and the new state.
Update the Q-value for the (state, action) pair using the Bellman update rule.
Set the current state to the new state.
Repeat from step 3 until the Q-values converge or a stopping condition triggers.

Convergence means the Q-values stop changing significantly between updates. At that point, the agent has learned a stable policy: for each state, it knows which action maximises expected cumulative reward.

The convergence guarantee assumes infinite exploration (every state-action pair is visited infinitely often) and a sufficiently decaying learning rate. In practice, neither condition is perfectly met, which means real Q-learning agents converge to approximate policies. Those approximations are where edge cases live, and edge cases are where adversarial inputs have outsized impact.

Exploration versus exploitation

A Q-learning agent faces a recurring decision: should it take the action with the highest known Q-value (exploitation), or try something it knows less about (exploration)?

Pure exploitation converges fast but gets stuck in local optima. The agent finds a decent path and never discovers a better one. Pure exploration never converges at all. The agent keeps trying random actions and never settles on a strategy.

The standard solution is the epsilon-greedy strategy. With probability epsilon, the agent takes a random action. With probability 1 minus epsilon, it takes the greedy action (the one with the highest Q-value).

The value of epsilon is typically annealed over time. Early in training, epsilon is high (0.9 or above) and the agent explores aggressively. As it accumulates experience, epsilon decreases (toward 0.1 or lower) and the agent shifts toward exploiting what it has learned.

For an attacker, the exploration phase is the window of maximum influence. During early training, the agent is forming its initial Q-value estimates. These early values anchor future learning because the update rule adjusts relative to the current value. Corrupted early values propagate through subsequent updates, and because the learning rate typically decreases alongside epsilon, the agent becomes progressively less capable of correcting those initial corruptions.

An attacker who can influence the training environment during the exploration phase, even briefly, can embed persistent biases in the Q-table that survive long after epsilon has decayed to near zero.

What Q-learning assumes, and what breaks when those assumptions fail

Q-learning’s convergence proof relies on two properties of the environment.

The Markov property requires that the next state depends only on the current state and action, not on any prior history. The agent does not remember where it has been. Each decision is made in isolation, based solely on the current Q-table lookup.

In real systems, the Markov property rarely holds perfectly. Network traffic patterns are influenced by time of day, user behaviour has session-level dependencies, and security events are correlated across time. When Q-learning is applied to environments that violate the Markov property, the Q-values become inaccurate because the algorithm is modelling a simpler world than the one it operates in. An attacker who understands this gap can construct action sequences that exploit the agent’s inability to account for history.

Stationarity requires that the environment’s dynamics do not change over time. The same action in the same state should produce the same distribution of outcomes regardless of when it is taken.

Stationarity almost never holds in adversarial settings. The act of deploying a Q-learning agent changes the environment it operates in (other agents adapt, attackers adjust tactics, infrastructure evolves). When the environment shifts, the Q-table becomes stale. Values learned under old conditions guide decisions in new ones. An attacker who can trigger environmental shifts after training can render a converged Q-table obsolete without ever modifying the table itself.

Where Q-learning shows up in practice

Q-learning and its derivatives appear in several security-adjacent contexts. Network intrusion detection systems use Q-learning agents to adapt response strategies. Automated penetration testing frameworks use RL agents to learn attack sequences. Game-playing AI (which shares architectural patterns with adversarial simulation) relies heavily on Q-learning variants. Adaptive access control systems use RL to adjust authentication requirements based on observed behaviour patterns.

In each case, the attack surface is the same, comprising the reward signal, the state representation, and the Q-table itself. An adaptive IDS that learns which responses are effective can be trained to deprioritise the correct response by feeding it misleading feedback. An automated pentest agent can be steered away from productive attack paths by manipulating the environment it trains in. An access control system that adapts based on observed patterns can be gradually conditioned to lower its thresholds for the attacker’s profile.

What defenders should actually do

For any production system built on Q-learning, the mitigations map directly to the attack vectors.

Protect the reward channel. Validate reward signals against independent ground truth. If the reward comes from a downstream system, verify that system’s integrity. Do not trust a single source of reward in any deployment where the signal could be intercepted or modified. Redundant reward channels with consensus mechanisms make single-channel poisoning detectable.

Audit the Q-table. In tabular Q-learning (as opposed to deep Q-networks, which we will cover in a later entry), the learned policy is fully inspectable. Run periodic checks for anomalous Q-values: cells that have shifted dramatically without corresponding environmental changes, actions that have become universally preferred or avoided, or value distributions that do not match expected reward patterns.

Control the training environment. If the agent is trained online (learning during deployment), the training environment is the production environment, and the production environment is adversary-accessible. Consider whether offline training with periodic deployment is viable. If online training is necessary, add integrity monitoring to the state observations and implement anomaly detection on the reward stream.

Decay the learning rate aggressively post-convergence. Once the Q-table has stabilised, reduce alpha to near zero. This makes the agent resistant to late-stage reward manipulation at the cost of adaptability. If the environment changes legitimately, a controlled retraining cycle with validated data is safer than leaving the agent continuously responsive to new signals.

Monitor exploration behaviour. If epsilon-greedy is in use, track whether the agent’s exploration patterns are consistent with the annealing schedule. An agent that is exploring more than expected may be receiving corrupted Q-values that prevent convergence. An agent that has converged on an unexpected policy may have been influenced during the exploration window.

The real vulnerability

Q-learning is transparent in a way that most ML algorithms are not. The Q-table is a flat data structure that encodes the agent’s entire decision-making process. Every value in it is interpretable. Every update follows a known formula. There is no hidden layer, no activation function, no gradient computation.

That transparency is both its strength and its weakness. A defender can inspect the Q-table and understand exactly why the agent makes every decision. An attacker can do the same thing. And because the update rule is deterministic given its inputs, an attacker who controls the inputs can calculate exactly what the Q-table will look like after any sequence of corrupted episodes.

Type to search