Deep learning

Every adversarial example you have ever seen in a research paper, every model inversion attack that reconstructed a face from a gradient, and every backdoor trigger embedded in a training pipeline depend on a single property of deep neural networks. Specifically, the entire model is differentiable. You can compute the gradient of the output with respect to any input, and that gradient tells you exactly how to change the input to change the output. The same mechanism that trains the network is the mechanism that breaks it. The previous entries in this series covered individual algorithms such as linear regression, logistic regression, decision trees, SVMs, ensemble methods, and reinforcement learning. Each had its own attack surfaces. Deep learning is where those surfaces multiply because deep neural networks absorb all of those prior concepts into a single architecture and add the new dimension of depth. Multiple layers of learned representations, none of them specified by the designer, each one an emergent abstraction that the model discovered on its own and that nobody fully understands.

This is the thirteenth entry in the AI red teaming series. We are crossing into the territory where adversarial machine learning became a research discipline, because deep learning is the architecture that forced the field into existence.

What a neural network actually is

A neural network is a function. It takes an input (an image, a block of text, a row of numerical features), multiplies it by a set of learned weights, applies a non-linear transformation, and passes the result forward. Repeat this process across multiple layers, and the network builds increasingly abstract internal representations of the data.

The architecture has three structural components.

The input layer receives the raw data. It performs no computation. Its job is to format the data into a numerical tensor that the rest of the network can process.

Hidden layers sit between input and output. Each hidden layer applies a linear transformation (matrix multiplication with learned weights, plus a bias term) followed by a non-linear activation function. A network with one hidden layer can approximate simple functions. A network with many hidden layers can approximate functions of arbitrary complexity. That is the “deep” in deep learning, which refers to stacking layers to build hierarchical representations where each layer captures patterns at a different level of abstraction.

The output layer produces the final result. For classification, it typically outputs a probability distribution across classes. For regression, it outputs a continuous value.

Every connection between neurons carries a weight. These weights are the model’s knowledge. A trained neural network is nothing more than a very large collection of floating point numbers arranged in matrices.

Activation functions

Without activation functions, a deep network would collapse into a single linear transformation regardless of how many layers you stack. Two matrix multiplications in sequence are equivalent to one matrix multiplication. The network would be no more expressive than logistic regression.

Activation functions break that linearity. They introduce kinks, curves, and saturation points into the network’s computation, allowing it to model complex, non-linear relationships. The choice of activation function affects everything, including how fast the model trains, whether gradients flow through deep layers, and how the network responds to adversarial perturbations.

Sigmoid squashes inputs into the range (0, 1). It was the default activation in early neural networks and is still used in output layers for binary classification. Its problem is gradient saturation. For inputs far from zero, the gradient approaches zero, which means deep layers receive vanishingly small gradient signals during training. This is the vanishing gradient problem, and it is the reason sigmoid fell out of favour for hidden layers.

Tanh squashes inputs into (-1, 1). It is zero-centred, which helps with gradient flow compared to sigmoid, but it still saturates at extreme values. Same fundamental limitation.

ReLU (Rectified Linear Unit) returns zero for negative inputs and passes positive inputs through unchanged. It solved the vanishing gradient problem for positive activations by maintaining a constant gradient of 1. ReLU is the default activation for hidden layers in most modern architectures. Its weakness is the “dying ReLU” problem, where neurons that output zero for all inputs in a batch receive zero gradient and never recover. They become permanently inactive.

For a red teamer, the activation function matters because it determines how gradients flow through the network, and gradient flow is the mechanism behind adversarial example generation. If the gradient is zero (saturated sigmoid) or dead (inactive ReLU), perturbations to the input in that region produce no change in the output. Adversarial attacks need live gradients. Understanding which activation functions the target uses tells you where the model is sensitive to input manipulation and where it is numb.

Backpropagation

Training a neural network means adjusting its weights so that the output matches the desired target for a given input. This requires knowing how much each weight contributed to the error. Backpropagation computes that contribution.

The process has two phases. In the forward pass, input data flows through the network layer by layer, producing a prediction. The loss function compares that prediction to the ground truth and produces a scalar error value. In the backward pass, the chain rule of calculus propagates that error backward through every layer, computing the gradient of the loss with respect to each weight. These gradients tell the optimiser how to adjust each weight to reduce the error.

This is elegant mathematics. It is also a weapon.

Adversarial example generation uses the same mechanism in reverse. Instead of computing the gradient of the loss with respect to the model’s weights (to update the model), you compute the gradient of the loss with respect to the model’s input (to update the input). You freeze the weights and ask how you should change this input to maximize the model’s error.

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy formalised this in 2014 with the Fast Gradient Sign Method (FGSM). The attack computes the gradient of the loss with respect to the input image, takes the sign of each gradient element, and adds a small perturbation in the direction that increases the loss:

x_adversarial = x + ε * sign(∇_x L(θ, x, y))

Where x is the original input, ε controls the perturbation magnitude, L is the loss function, θ is the model’s parameters, and y is the true label. The perturbation is imperceptible to a human but sufficient to flip the model’s classification.

This is not a clever hack. It is a direct consequence of how backpropagation works. Any system that uses gradient-based optimisation to learn is, by the same mathematics, vulnerable to gradient-based manipulation. The training algorithm and the attack algorithm are the same algorithm pointed in different directions.

Loss functions

The loss function quantifies how wrong the model’s prediction is. Training minimises this function. The choice of loss function determines what the model optimises for, which means it also determines what the model is willing to sacrifice.

Mean Squared Error (MSE) is the standard loss for regression. It penalises large errors quadratically, making the model sensitive to outliers. An attacker who can inject outlier training samples will have outsized influence on the model’s learned weights, because the loss function amplifies their contribution.

Cross-Entropy Loss is the standard for classification. It penalises confident wrong predictions more heavily than uncertain ones. A model trained with cross-entropy loss will develop sharp decision boundaries in regions where training data is dense. In regions where training data is sparse, the boundaries are softer and more susceptible to adversarial perturbation. The geometry of the loss surface directly determines where the model is robust and where it is fragile.

For a red teamer, the loss function is the model’s stated priorities. If you know the loss function, you know exactly what kind of inputs will produce the largest gradient signal, and therefore the most effective adversarial perturbations. You also know what the model explicitly does not penalise, which is often where the real vulnerability sits.

Optimisers

The optimiser uses the gradients computed by backpropagation to update the model’s weights. Different optimisers navigate the loss surface differently, and those differences have security implications.

Stochastic Gradient Descent (SGD) updates weights using the gradient from a single mini-batch of data at each step. It is noisy, which can help escape local minima but makes training less stable. The noise in SGD creates variance in the learned weights, which means training the same architecture on the same data twice can produce slightly different decision boundaries. For an attacker probing a model, this means the exact adversarial example that works against one training run might not transfer to another.

Adam combines momentum (tracking the running average of past gradients) with adaptive learning rates for each parameter. It converges faster than SGD and handles sparse gradients well. Models trained with Adam tend to find sharper minima in the loss surface, which can make them more susceptible to adversarial examples in the neighbourhood of the decision boundary. The sharper the minimum, the steeper the nearby loss surface, and the easier it is for a small perturbation to climb out of the correct prediction basin.

RMSprop normalises gradients by their recent magnitude, preventing parameters with large gradients from dominating the update. It handles non-stationary objectives well, which is why it shows up in reinforcement learning training loops (connecting back to the RL entries in this series).

The optimiser choice shapes the geometry of the solution the model converges to. Different optimisers find different minima, and those minima have different adversarial robustness properties. Research by Dinh, Pascanu, Bengio, and others has shown that flat minima tend to generalise better and resist adversarial perturbations more effectively than sharp minima. The optimiser is a factor in the model’s attack surface.

Hyperparameters

Hyperparameters are the architectural and training choices that are fixed before the model sees any data. The learning rate controls how large each weight update is. Too high, and the model overshoots the minimum. Too low, and it converges slowly or gets stuck. The number of hidden layers and neurons per layer determines the model’s capacity, which is its ability to represent complex functions.

From an adversarial perspective, the hyperparameters that matter most are those that affect overfitting. An overfit model has memorised its training data rather than learning general patterns. This makes it vulnerable to membership inference attacks (the attacker can determine whether a specific data point was in the training set by querying the model and observing its confidence), model inversion attacks (reconstructing training data from the model’s learned representations), and adversarial examples in the space between memorised data points where the model has no learned structure to fall back on.

Regularisation techniques like dropout (randomly deactivating neurons during training), weight decay (penalising large weights), and early stopping (halting training before the model overfits) all mitigate these risks. They are not security controls. They are training heuristics that happen to have security side effects.

What this means for the red teamer

Deep learning is the first architecture in this series where the model’s training mechanism is directly reusable as an attack mechanism. With decision trees, the attack surface was the interpretability of the model’s logic. With SVMs, it was the geometry of the decision boundary. With ensemble methods, it was the diversity (or lack thereof) among base learners. Deep learning subsumes all of those attack surfaces and adds the gradient.

The gradient is the thread that connects every adversarial technique against deep neural networks. Adversarial examples use the gradient to craft malicious inputs. Model inversion uses the gradient to reconstruct private training data. Backdoor attacks exploit the gradient during training to embed hidden triggers. Membership inference uses the gradient (indirectly, through confidence scores) to probe the model’s memory of individual data points.

If you understand backpropagation, you understand the attack surface of every deep learning system deployed today. The mechanism does not change between a convolutional network classifying images and a transformer processing text. The architecture changes, but the gradient computation does not.

Leave a Reply

Your email address will not be published. Required fields are marked *

RELATED

Preprocessing the spam dataset

Every text cleaning step in a spam classifier either blocks an evasion path or opens one. See how preprocessing shapes…

Bayesian spam classification: the dataset

Preparing the SMS Spam Collection dataset for Bayesian classification, covering download, extraction, loading, and cleaning through an adversarial lens.

Spam classification: Naive Bayes filters

How Naive Bayes spam filters work, why the independence assumption makes them exploitable, and how GoodWords attacks broke email filtering…

Metrics for evaluating a model

Learn how accuracy, precision, recall, and F1-score work in practice, where each metrics deceive, and how adversaries exploit the gaps…