Reinforcement learning

An agent learns to play a game by maximising its score. You change what “scoring” means. The agent adapts to the new objective with the same efficiency it applied to the original one. It does not know the rules have changed and it cannot. The reward signal is the only truth it has. Every model covered so far in this series has learned from static data. Supervised models learned from labelled examples. Unsupervised models learned structure from unlabelled distributions. In both cases, the data existed before training began, and the model’s relationship with it was passive through the process of observing, learning, and generalizing. Reinforcement learning breaks that pattern, the agent generates its own training data by acting in an environment, observing consequences, and adjusting. It learns through interaction, not inspection.

That interaction loop is the attack surface. Where supervised models are vulnerable to data poisoning by corrupting the dataset before training and unsupervised models are vulnerable to distribution manipulation by shifting what “normal” looks like, reinforcement learning agents are vulnerable to something more direct because you change their reality. Alter the environment, corrupt the reward signal, or perturb what the agent perceives, and the agent trains itself to do the wrong thing. You do not even need access to the model. You need access to the world the model lives in.

How reinforcement learning actually works

The core loop is simple. An agent exists in some environment. At each step, it observes the current state, selects an action, and receives two things back, which are a new state and a reward. The reward is a scalar value, a single number that tells the agent how good or bad that action was in that context. The agent’s goal is to learn a policy, a mapping from states to actions, that maximises cumulative reward over time.

That distinction matters. A chess agent that only maximised immediate reward would capture every piece it could, regardless of whether doing so opened it up to a checkmate three moves later. The cumulative framing forces the agent to consider long-term consequences, which is implemented through the value function. This function is an estimate of how much total reward the agent expects to accumulate from a given state onwards while following its current policy.

Two parameters control how the agent balances present and future. The discount factor (gamma) determines how much weight the agent gives to future rewards relative to immediate ones. A gamma of 0 produces a greedy agent that only cares about the next reward. A gamma of 1 produces an agent that weighs a reward received in a thousand steps the same as one received now. In practice, gamma sits somewhere between 0.9 and 0.99 for most applications, creating agents that care about the future but not infinitely.

The second control is the exploration-exploitation tradeoff. An agent that always picks the action it currently believes is best (exploitation) will never discover better strategies. An agent that always picks randomly (exploration) will never capitalise on what it has learned. Most RL algorithms manage this tension explicitly, through mechanisms like epsilon-greedy (pick randomly with probability epsilon, otherwise exploit) or more sophisticated approaches like upper confidence bounds or entropy-regularised policies.

Model-based versus model-free

RL algorithms split into two broad families based on whether the agent builds an internal model of how the environment works.

Model-based agents learn a transition function that predicts the likely next state S‘ given a specific state S and action A. This internal model lets the agent plan ahead without actually taking actions, simulating trajectories through its learned model of the world. The advantage is sample efficiency. The agent needs fewer real interactions because it can “practise” in its own simulation. The vulnerability is that the internal model can be wrong, and if you can influence the data the agent uses to build that model, you can make it wrong in specific, exploitable ways.

Model-free agents skip the internal model entirely. They learn the policy or value function directly from experience, without ever trying to predict how the environment works. Q-learning is the canonical example because the agent maintains a table or neural network approximation of how valuable each action is in each state and updates those values based on actual rewards received. The agent does not know why an action is good, it just knows that it is. The vulnerability here is different: model-free agents are entirely dependent on the accuracy and honesty of the reward signal they receive. They have no internal world model to cross-check against.

For the red teamer, the distinction determines the attack vector. Model-based agents are susceptible to observation poisoning. If you feed the agent misleading state information, its internal model of the world becomes corrupted. Meanwhile, model-free agents are susceptible to reward manipulation, where changing the reward signal causes the agent to learn a policy optimized for the wrong objective.

The components as attack surfaces

Each component of the RL loop represents a distinct point where adversarial pressure can be applied.

The reward signal. This is the most direct attack vector. If you can modify the reward the agent receives, you control what it optimises for. The agent will faithfully learn to maximise whatever signal you provide, regardless of whether that signal reflects the intended objective. This is reward hacking in the adversarial context, where an attacker does not need to fool the model into misclassifying an input. They need to redefine what success means.

In 2016, researchers at OpenAI documented cases where RL agents found unexpected ways to maximise reward without completing the intended task. A boat racing agent discovered it could score higher by spinning in circles and hitting boost pads than by finishing the race. The agent was not broken. It was doing exactly what the reward function asked. The reward function was asking the wrong question. An adversary who can influence reward design can create the same misalignment intentionally.

The state observation. The agent’s view of the environment is not the environment itself. It is a representation, a set of features or sensor readings that the agent treats as ground truth. Perturb those observations, and the agent makes decisions based on a world that does not match reality. In autonomous driving, this is the domain of adversarial patches and sensor spoofing. For example, a stop sign with carefully placed stickers might be read as a speed limit sign by the agent’s vision system. The agent’s policy remains functional, but its perception is compromised. Even if the underlying logic for decision-making is sound, the agent is operating on “hallucinated” or manipulated data, leading it to execute a perfectly logical action for a reality that doesn’t exist.

The environment itself. In simulation-trained agents (which is most of them, because real-world training is expensive and dangerous), the training environment is a model of reality. If the simulation is inaccurate, the agent learns a policy optimised for a world that does not exist. This sim-to-real gap is a known engineering problem, but it is also an adversarial opportunity. An attacker with access to the training environment can introduce subtle inaccuracies that cause the agent to learn exploitable behaviours when deployed in the real world.

The policy. Once trained, the policy itself is a target. Policy extraction attacks work similarly to model stealing in supervised learning: by querying the agent repeatedly with different states and observing which actions it takes, an attacker can train a surrogate policy that approximates the original. The surrogate can then be analysed offline to find exploitable patterns, states where the policy makes predictable or manipulable decisions. In competitive environments (game AI, trading algorithms), knowing the opponent’s policy is knowing how to beat it.

Episodic versus continuous

RL tasks fall into two structural categories. Episodic tasks have defined endpoints, such as when a game ends, a maze is solved, or a delivery is completed. The agent resets and starts a new episode. Continuous tasks have no terminal state. A robot arm controlling a manufacturing process runs indefinitely. A traffic management system never stops.

The adversarial implications differ. In episodic tasks, the agent’s policy is shaped by the reward accumulated within each episode. An attacker who can influence early episodes (when the policy is still forming) has disproportionate impact on the final learned behaviour. Early poisoning is cheap and effective because the agent has not yet developed the stable value estimates that would help it discount corrupted experiences.

In continuous tasks, the attack surface is persistence. There is no reset. If the attacker can introduce a subtle, ongoing perturbation to the reward signal or state observations, the agent’s policy will drift over time. This is particularly dangerous in deployed systems where the agent continues learning (online learning), because the attacker does not need a single decisive corruption. They need a sustained, low-magnitude influence that gradually shifts the policy in the desired direction without triggering anomaly detection.

Where this shows up in practice

RL is less prevalent in traditional security tooling than supervised or unsupervised learning, but it appears in domains where sequential decision-making matters. Automated penetration testing tools like those based on the CyberBattleSim framework use RL agents to learn attack strategies. Network routing and resource allocation systems use RL to optimise traffic flow. Adaptive authentication systems use RL-like bandit algorithms to decide when to require additional verification factors. Trading algorithms use RL to learn execution strategies.

In each of these deployments, the attack surface follows the same pattern. The agent trusts its reward signal. The agent trusts its state observations. The agent assumes its training environment accurately represents the deployment environment. Each of those assumptions is testable, and testing them is red teaming.

The most operationally relevant pattern for security practitioners is the reward manipulation attack against adaptive systems. An authentication system that uses a bandit algorithm to decide between MFA prompts, risk-based challenges, and frictionless access can be gamed by an attacker who deliberately provides the system with interactions designed to shift its learned policy toward less secure defaults. The attacker is not exploiting a vulnerability. They are training the system to let them in.

Defending RL systems

Defending reinforcement learning requires treating the entire interaction loop as untrusted, not just the model.

Validate reward signals independently. Do not rely on a single reward source. Where possible, use multiple independent reward channels and alert on divergence between them. If one reward signal disagrees with the others, the policy update should be halted, not averaged.
Bound the policy’s action space. Constrain what the agent is allowed to do regardless of what it has learned. Hard limits on actions (maximum speed, minimum verification requirements, forbidden state transitions) act as guardrails that survive policy corruption. The agent might learn the wrong policy, but the damage is bounded.
Monitor for policy drift. In continuously learning systems, track the agent’s policy over time. Sudden shifts in action distributions or value estimates for previously stable states are indicators of reward manipulation or observation poisoning. Establish baselines during supervised deployment and alert on deviation.
Treat the sim-to-real gap as a security boundary. Agents trained in simulation should be tested against adversarial perturbations to the deployment environment before going live. The gap between simulation and reality is not just an engineering problem. It is a trust boundary that an adversary can exploit.
Separate the training loop from the deployment environment in production systems. Online learning (where the deployed agent continues updating its policy) is convenient but dangerous. Every interaction becomes a potential training input, and an attacker who can interact with the system is an attacker who can influence its learning. Where online learning is necessary, use delayed batch updates with validation rather than immediate policy adjustments.

The real lesson

Every model in this series so far has been passive. It learned from data that was given to it, and it applied what it learned to new inputs. The model’s relationship with the world was one-directional: data in, predictions out. Reinforcement learning breaks that pattern. The agent acts on its world, changes it, observes the changes, and learns from the consequences. That feedback loop means the agent’s training data is its own behaviour and the world’s response to it.

For a red teamer, the implication is that you do not need to touch the model, the training data, or the architecture. You need to be part of the environment. If you can influence what the agent sees, what reward it receives, or how its actions affect the world, you are already inside the training loop. The model will do the rest.

Type to search