The maths behind the models

Every anomaly detector, classifier, and language model you will encounter in this series runs on the same small set of mathematical operations. You do not need to derive them from scratch. You need to recognise them when they appear in a model’s documentation, loss function, or configuration, and understand what they are doing to your data.

This reference exists for that purpose. It is a companion to the broader AI for Security series on Kosokoking. Bookmark it. When a later article drops a symbol you have not seen in years, come back here.

Notation you will see in every paper

Before anything else, the notation. AI papers and model documentation reuse the same handful of conventions, and misreading one subscript can send you down the wrong path entirely.

Subscript notation (x_t)

A subscript indexes a variable by position, time step, or category. In security AI work, you will encounter this constantly in sequential data:

x_t = the value of x at time step t

When a network intrusion detection model processes packet sequences, each x_t is a feature vector for the t-th packet in the flow. The subscript tells you where you are in the sequence.

Superscript notation (x^n)

Superscripts denote exponents. x^2 is x multiplied by itself. This appears everywhere in distance calculations, error functions, and polynomial feature engineering:

x^2 = x * x

When your anomaly detection model computes squared error between predicted and observed traffic volume, that is superscript notation at work.

Summation (Σ)

The summation symbol tells you to add up a sequence of terms:

Σ_{i=1}^{n} a_i = a_1 + a_2 + ... + a_n

Loss functions are summations. When a malware classifier computes cross-entropy loss across a batch of 256 samples, it is summing the individual losses for each sample to produce one number the optimiser can act on.

Norms (||…||)

A norm measures the size of a vector. The Euclidean norm (L2) is the most common:

||v|| = sqrt(v_1^2 + v_2^2 + ... + v_n^2)

Two other norms appear frequently:

||v||_1 = |v_1| + |v_2| + ... + |v_n|       (L1 norm / Manhattan distance)
||v||_∞ = max(|v_1|, |v_2|, ..., |v_n|)     (L-infinity norm)

In practice: when a UEBA (user and entity behaviour analytics) system flags an account because its activity vector is “far” from its historical baseline, it is computing a norm. L1 and L2 norms also appear in model regularisation, where they penalise large weights to prevent overfitting. L1 regularisation tends to zero out irrelevant features entirely, which is useful when you want a sparse model that tells you which log fields actually matter.

Logarithms and exponentials

These two function families underpin information theory, probability, and nearly every loss function you will encounter.

Logarithm base 2 (log2)

log2(8) = 3    (because 2^3 = 8)

Log base 2 measures information in bits. Entropy, the core metric in information theory, is computed with log2. When a decision tree in a threat classification model chooses which feature to split on, it picks the feature that maximises information gain, measured in bits. If your IDS model reports an entropy value for DNS query distributions, it is using log2.

Natural logarithm (ln)

ln(e^2) = 2

The natural logarithm uses Euler’s number (e ≈ 2.718) as its base. Cross-entropy loss, the standard loss function for classification models, uses natural logarithms. When your phishing detection model outputs a probability of 0.95 that an email is malicious, the loss function computes -ln(0.95) to penalise the model proportionally to its confidence. A wrong prediction with high confidence produces a large loss. That feedback is what forces the model to calibrate.

Exponential function (e^x)

e^2 ≈ 7.389

The exponential function is the inverse of the natural logarithm. It appears in the softmax function, which converts raw model outputs (logits) into probabilities. When your malware classifier outputs a vector of scores for [benign, trojan, ransomware, worm], softmax exponentiates each score and normalises them so they sum to 1. The exponential amplifies differences: a logit of 5.0 versus 4.0 becomes a much larger gap after exponentiation.

Exponential function, base 2 (2^x)

2^3 = 8

Base-2 exponentials appear in binary encoding, hash function analysis, and information-theoretic metrics. When assessing password strength or brute-force resistance, you express the search space as 2^n where n is the number of bits of entropy. A 128-bit key means 2^128 possible values. That number is why brute force does not work against properly implemented symmetric encryption.

Vectors, matrices, and the operations that run neural networks

If logarithms power the loss functions, linear algebra powers everything else. Every layer in a neural network is a matrix operation. Understanding this section means understanding what the model is actually doing to your data at each step.

Matrix-vector multiplication (A * v)

A * v = [[1, 2], [3, 4]] * [5, 6] = [17, 39]

This is the fundamental operation of a neural network layer. The matrix A contains the learned weights. The vector v is your input (or the output of the previous layer). The multiplication transforms v into a new representation. When a network-based IDS processes a feature vector representing a single network flow, the first layer multiplies that vector by its weight matrix. The result is a new vector that encodes learned patterns.

Matrix-matrix multiplication (A * B)

A * B = [[1, 2], [3, 4]] * [[5, 6], [7, 8]] = [[19, 22], [43, 50]]

Batched operations. Instead of processing one input vector at a time, models process matrices where each row is a separate input. When your SIEM’s ML pipeline ingests a batch of 512 log entries simultaneously, it is performing matrix-matrix multiplication: the weight matrix multiplied by the input batch matrix.

Transpose (A^T)

A = [[1, 2], [3, 4]]
A^T = [[1, 3], [2, 4]]

Transposition swaps rows and columns. It appears in attention mechanisms (the backbone of transformer models), dot product calculations, and data reshaping. When a transformer-based log analysis model computes self-attention, it transposes the key matrix before multiplying it with the query matrix. That transpose is what allows the model to compute similarity scores between every pair of positions in the input sequence.

Inverse (A^{-1})

A = [[1, 2], [3, 4]]
A^{-1} = [[-2, 1], [1.5, -0.5]]

The inverse of a matrix A is the matrix that, when multiplied by A, produces the identity matrix. In security analytics, matrix inversion appears in Mahalanobis distance calculations, which measure how far a data point is from a distribution while accounting for correlations between features. If your anomaly detection model uses Mahalanobis distance to flag unusual authentication patterns, it is inverting the covariance matrix of normal behaviour.

Determinant (det(A))

A = [[1, 2], [3, 4]]
det(A) = 1*4 - 2*3 = -2

The determinant is a scalar value that tells you whether a matrix is invertible (non-zero determinant) and how the matrix scales space. A determinant of zero means the matrix collapses at least one dimension of information, which signals that your features are linearly dependent. If you are building a feature set for a threat classifier and the feature matrix has a near-zero determinant, some of your features are redundant. Remove them before training.

Trace (tr(A))

A = [[1, 2], [3, 4]]
tr(A) = 1 + 4 = 5

The trace is the sum of the diagonal elements. It equals the sum of the eigenvalues and appears in matrix decomposition methods and some regularisation techniques. In covariance analysis for anomaly detection, the trace of the covariance matrix gives you the total variance across all features, a quick measure of how spread out normal behaviour is.

Eigenvalues and eigenvectors

These concepts feel abstract until you see what they do in practice. They decompose a transformation into its fundamental directions and magnitudes.

Eigenvalue (λ) and eigenvector (v)

A * v = λ * v

An eigenvector of a matrix A is a vector whose direction does not change when A is applied to it. It only gets scaled by the eigenvalue λ. In security AI, the primary application is principal component analysis (PCA). When you have 200 features extracted from network flow data, PCA uses eigenvalues and eigenvectors of the covariance matrix to identify which directions in that 200-dimensional space capture the most variance. You keep the top eigenvectors (the ones with the largest eigenvalues) and discard the rest. The result is a lower-dimensional representation that retains the signal and drops the noise.

This is how some IDS systems reduce a massive feature space into something a model can process efficiently without losing the patterns that distinguish normal traffic from attack traffic.

Probability and statistics

Machine learning models are probabilistic. They do not output certainty. They output distributions, likelihoods, and confidence intervals. Understanding this section is understanding what your model’s output actually means.

Conditional probability (P(x | y))

P(Output | Input)

The probability of x given that y is true. Every classification model outputs a conditional probability. P(malicious | features)is what your phishing detector computes: the probability that an email is malicious given the observed features (sender reputation, URL structure, header anomalies, language patterns). Bayesian spam filters were among the earliest security applications of conditional probability, and the principle has not changed.

Expectation (E[X])

E[X] = Σ x_i * P(x_i)

The expected value is the probability-weighted average of all possible outcomes. In reinforcement learning for automated penetration testing, the agent selects actions that maximise expected reward. In risk quantification, expected loss is the product of probability and impact across all threat scenarios.

Variance (Var(X)) and standard deviation (σ)

Var(X) = E[(X - E[X])^2]
σ(X) = sqrt(Var(X))

Variance measures how spread out a distribution is. Standard deviation is the square root of variance and is expressed in the same units as the data, which makes it more interpretable. In anomaly detection, these are baseline metrics. If the mean number of failed login attempts per hour is 12 with a standard deviation of 3, an hour with 45 failures is more than 10 standard deviations from normal. Your model should flag that. If it does not, the problem is not the maths. It is the threshold.

Covariance (Cov(X, Y))

Cov(X, Y) = E[(X - E[X])(Y - E[Y])]

Covariance measures how two variables move together. Positive covariance means they increase together. Negative means one increases as the other decreases. In security analytics, understanding covariance between features helps you spot redundancy and correlation. If bytes_sent and bytes_received are highly covariant in your training data, a model may over-weight that relationship and miss attacks that break the pattern.

Correlation (ρ)

ρ(X, Y) = Cov(X, Y) / (σ(X) * σ(Y))

Correlation normalises covariance to a range of -1 to 1, making it comparable across features with different scales. A correlation of 0.98 between two features means they carry nearly identical information. In feature engineering for security models, correlation analysis is how you prune redundant inputs before they waste model capacity.

Sets and classification logic

Set theory maps directly onto how detection systems categorise events.

Cardinality (|S|)

S = {1, 2, 3, 4, 5}
|S| = 5

The number of elements in a set. In security: the cardinality of your IoC (indicator of compromise) set, the number of unique source IPs in an alert cluster, the size of a user’s typical application set for behavioural baselining.

Union (∪)

A = {1, 2, 3}, B = {3, 4, 5}
A ∪ B = {1, 2, 3, 4, 5}

All elements in either set. When correlating alerts from two detection engines, the union of their alert sets gives you the total coverage. If engine A flags {event1, event2, event3} and engine B flags {event3, event4, event5}, the union is five distinct events.

Intersection (∩)

A = {1, 2, 3}, B = {3, 4, 5}
A ∩ B = {3}

Elements common to both sets. The intersection of two detection engines’ alert sets tells you where they agree. High intersection means redundancy. Low intersection means complementary coverage. Both are useful to know.

Complement (A^c)

U = {1, 2, 3, 4, 5}, A = {1, 2, 3}
A^c = {4, 5}

Everything not in A. In detection terms: if A is the set of known-good processes on a host, A^c relative to all observed processes is your set of unknowns. Allowlisting is complement logic applied to endpoint security.

Functions you will see in model architectures

max and min

max(4, 7, 2) = 7
min(4, 7, 2) = 2

The ReLU activation function, used in most modern neural networks, is defined as max(0, x). If the input is negative, the output is zero. If positive, it passes through unchanged. This simple operation is what gives neural networks their non-linearity. Without it (or a function like it), stacking layers would be pointless because any number of linear transformations collapse into a single linear transformation.

The min function appears in clipping operations, learning rate schedules, and threshold logic.

Function notation (f(x))

f(x) = x^2 + 2x + 1

A function maps an input to an output. Every neural network is a function, a composition of many smaller functions (layers). f(x) = model(input_features) is the abstraction that unifies everything in this reference. The maths above describes what happens inside that function.

Reciprocal (1/x)

1/5 = 0.2

Division by a value. Appears in learning rate calculations, normalisation (dividing by the number of samples or the norm of a vector), and attention score scaling. In the transformer attention mechanism, raw dot product scores are divided by sqrt(d_k) (the square root of the key dimension) to prevent the scores from growing too large. That division is a reciprocal operation, and without it, the softmax would saturate and gradients would vanish.

Comparison operators

These are straightforward but appear constantly in threshold-based detection logic and conditional model behaviour.

Operator	Meaning	Security example
`>=`	Greater than or equal to	Alert if `risk_score >= 0.85`
`<=`	Less than or equal to	Suppress if `confidence <= 0.3`
`==`	Equal to	Match if `protocol == "DNS"`
`!=`	Not equal to	Flag if `expected_hash != observed_hash`

Every detection rule you write uses these. Every model threshold is a comparison operator applied to a probability or score.

Where this connects

None of these operations exist in isolation. A single forward pass through a malware classification model chains them together: matrix multiplications transform input features, ReLU (max) introduces non-linearity, softmax (exponentials and reciprocals) produces probabilities, cross-entropy loss (logarithms and summation) measures error, and the gradient (derivatives, not covered here, but coming in a later article) tells the optimiser which weights to adjust.

Understanding each piece means you can read a model’s architecture and know what it is doing to your data at every step. That is the difference between configuring a tool and understanding it.

Type to search