Naive Bayes

Append 25 carefully chosen words to a spam email and a Naive Bayes filter will classify it as legitimate. The classifier does not weigh those words against the existing content. It multiplies their probabilities independently, one by one, each token shifting the posterior. The model’s founding assumption, that features do not influence each other, is the mechanism that makes it trivially exploitable.

This is the seventh entry in the AI red teaming series. Previous articles laid the groundwork (AI fundamentals, the maths behind the models, supervised learning) before working through linear regression, logistic regression, and decision trees. Each of those models exposed a different flavour of assumption-as-attack-surface. In Naive Bayes, the algorithm’s core mathematical property is the exact same mechanism an attacker uses for evasion, making the pattern absurdly direct.

Bayes’ theorem in 30 seconds

Before getting to the classifier, you need the theorem it runs on. Bayes’ theorem is a formula for updating a belief when new evidence arrives:

P(A|B) = [P(B|A) * P(A)] / P(B)

The posterior, denoted as P(A∣B), represents the probability of A given that B happened. The likelihood, or P(B∣A), is the probability of observing B if A is true. P(A) serves as the prior by showing how likely A was before you saw any evidence, while P(B) acts as the marginal to show the overall probability of B happening at all.

A concrete example makes this easier. Suppose a disease affects 1% of a population. A test for it is 95% accurate (true positive rate), with a 5% false positive rate. Someone tests positive. What is the actual probability they have the disease?

P(B) = (0.95 * 0.01) + (0.05 * 0.99) = 0.059
P(disease | positive) = (0.95 * 0.01) / 0.059 = ~0.161

A 95% accurate test, and the probability is only 16.1%. The low prior (1% prevalence) dominates the result. This is the core mechanic that Naive Bayes classifiers inherit in which prior beliefs and likelihoods combine to produce a posterior that can be counterintuitive without an understanding of the maths. For a red teamer, the takeaway is that probability updates are multiplicative, and any input that shifts the likelihood term shifts the output.

How Naive Bayes classifies

A Naive Bayes classifier applies Bayes’ theorem to assign a class label to an input by computing posterior probabilities for each possible class, then picking the highest one. In a spam filter, the system determines its outcome by comparing whether the posterior probability of “spam” is higher or lower than the posterior probability of “ham” based on the words provided.

The process has four steps.

First, the model calculates prior probabilities from training data. If 20% of training emails are spam, P(spam) = 0.2. This is the baseline before any features are considered.

Second, it calculates likelihoods representing the probability of each feature appearing given each class. In text classification, this means computing how often each word occurs in spam versus ham emails. The word “free” might have P(free | spam) = 0.08 and P(free | ham) = 0.01.

Third, for a new input, it applies Bayes’ theorem by multiplying the prior by every feature’s likelihood for each class. This is where the “naive” part comes in. The model assumes every feature is conditionally independent given the class. It treats the probability of seeing “free” as entirely unrelated to the probability of seeing “viagra” in the same message, provided you already know the class label.

Fourth, it assigns the class with the highest posterior.

The independence assumption is mathematically convenient and computationally cheap. It reduces a complex joint probability calculation into a product of individual feature probabilities. For a vocabulary of 10,000 words, Naive Bayes needs to estimate 10,000 individual conditional probabilities per class instead of modelling every possible word combination.

Three variants and what they assume

The specific probability distribution assumed for each feature determines which variant of Naive Bayes is used:

Gaussian Naive Bayes assumes continuous features follow a normal distribution. If the model is classifying network traffic by packet size and inter-arrival time, it models those features as bell curves per class. This is common in anomaly detection and intrusion detection systems where the input features are numerical measurements.

Multinomial Naive Bayes works with discrete frequency counts. In text classification, the features are word occurrence counts. This is the variant behind most Bayesian spam filters and is the one that matters most in the evasion context, because its inputs (token frequencies) are directly manipulable by an attacker.

Bernoulli Naive Bayes uses binary features, meaning the word is either present or absent without any consideration of its frequency. Some document classifiers and simpler detection systems use this approach when the presence of a feature matters more than how many times it appears.

The choice of variant dictates what an attacker needs to manipulate. Against Multinomial Naive Bayes, you inject or inflate token counts. Against Gaussian Naive Bayes, you shift numerical feature values toward the target class’s distribution. Against Bernoulli, you add or remove specific feature flags.

Why red teamers should care

Naive Bayes classifiers are deployed in enough security-adjacent systems that understanding their failure mode is operationally useful:

Spam and phishing filters. Most Bayesian email filtering engines, including SpamAssassin, Bogofilter, and Rspamd, use Naive Bayes as a scoring component. Even when wrapped in larger ensemble systems, the Bayesian component’s score contributes to the final verdict.
Network intrusion detection. Naive Bayes models appear in academic and some commercial NIDS implementations, classifying network flows as malicious or benign based on extracted features like packet length, protocol flags, and payload entropy.
Malware classification. Static analysis pipelines sometimes use Naive Bayes to classify binaries based on extracted feature vectors: imported API calls, section entropy, header metadata.
Prompt injection detection. Recent research, including work published at the end of 2025, has explored using Naive Bayes classifiers as lightweight filters in front of large language models, classifying user prompts as benign or malicious before they reach the model.

In each case, the classifier sits on a decision boundary. An attacker who understands how that boundary is computed can move inputs across it.

The GoodWords attack

The GoodWords attack is the canonical evasion technique against Naive Bayes text classifiers, first formally described by Daniel Lowd and Christopher Meek at the 2005 Conference on Email and Anti-Spam.

The mechanism is simple. Because Naive Bayes computes posteriors as a product of independent feature likelihoods, every feature contributes to the final score independently. If an attacker knows which tokens carry strong (legitimate) priors, appending those tokens to a spam message will push the posterior toward ham, one multiplication at a time.

In a white-box scenario (where the attacker has access to the model’s parameters), the attack is almost trivially effective. Extract the feature log-probabilities. Sort by the ratio of ham likelihood to spam likelihood. Take the top-ranked tokens. Append them to the spam payload.

Research by Lowd and Meek found that in a passive attack (no feedback from the filter), appending 150 carefully chosen words was enough to get 50% of blocked spam past a Naive Bayes filter. In an active attack (where the attacker can query the filter and observe the result), that number dropped to 30 words.

Hack The Box’s AI Evasion Foundations module operationalises this by providing a lab environment for hands-on practice. In their implementation, injecting approximately 25 ham-dominant tokens into a spam SMS achieves a 100% evasion rate against a Multinomial Naive Bayes classifier. Tokens like “later”, “work”, “ask”, and “home” carry strong enough ham priors that the classifier’s posterior collapses entirely. The spam content is still there. The model cannot see it through the noise.

This works because of the independence assumption. The model does not ask “why are these unrelated words sitting at the end of an otherwise spammy message?” It cannot ask that question. Each token is evaluated in isolation, and the ham tokens overwhelm the spam tokens through sheer multiplicative weight.

Bayesian poisoning

GoodWords operates at inference time. Bayesian poisoning targets the training data itself.

The technique works by sending emails to the target that contain large volumes of legitimate text alongside spam content. If the user (or automated system) marks these messages as non-spam during training, the filter learns to associate spam-indicator words with the ham class. Over time, the filter’s probability estimates degrade.

There are two attack types. Type II attacks try to get spam delivered by attaching benign content. Type I attacks try to increase false positives by polluting the filter’s model of what “ham” looks like, eventually causing it to misclassify legitimate messages as spam. The second variant is more disruptive as it results in a filter that blocks real emails being turned off.

At the 2004 MIT Spam Conference, John Graham-Cumming demonstrated that a machine learning spam filter could be used to defeat another by automatically learning which words to append. A separate Bayesian system, trained on the target filter’s feedback, iteratively identified effective poison words. The countermeasure is straightforward (disable remote tracking pixels that provide delivery confirmation), but the principle applies beyond email: any system that retrains on user-labelled data is exposed to the same feedback loop.

What makes Naive Bayes structurally fragile

The independence assumption is the root cause of every attack described above. But a few secondary properties amplify the problem:

Multiplicative likelihood accumulation. Because posteriors are products of individual likelihoods, injecting high-confidence tokens for the wrong class creates a compounding effect. Each additional GoodWord does not just add to the score; it multiplies it. This makes the evasion curve steep: a small number of well-chosen tokens produces a large shift.

No sequence or context awareness. Naive Bayes treats a bag of words as an unordered set. It cannot detect that 25 benign tokens have been appended to the end of a message in a block. A human reading the email would notice immediately. The model has no mechanism to notice at all.

Transparent feature weighting. In a white-box scenario, the model’s entire decision logic is a lookup table of per-feature log-probabilities per class. There is no hidden layer, no nonlinear transformation, no embedding space to reverse-engineer. The model is fully legible, and full legibility means full exploitability.

Sensitivity to vocabulary distribution. If the training corpus is skewed (more ham than spam, or vice versa), the prior probability alone can dominate classification. An attacker who knows the training distribution can estimate how much feature manipulation is needed to flip a prediction without ever seeing the model’s parameters.

Defences (and their limits)

If you are defending a system that uses Naive Bayes, there are specific mitigations worth implementing. None of them eliminate the structural weakness, but they raise the cost of exploitation.

Feature hashing and n-gram analysis. Instead of individual word tokens, use multi-word sequences (bigrams, trigrams) as features. The phrase “free money” occurring in a message is harder to dilute than the independent tokens “free” and “money” because the attacker would need to inject the exact n-gram in a ham-associated context. This partially breaks the independence assumption in a useful direction, at the cost of a larger feature space.

Rate-limited retraining with holdout validation. If the system retrains on user-labelled data, rate-limit how quickly new labels influence the model and validate against a held-out set. Sudden shifts in feature probability distributions are a signal that the training pipeline is being poisoned.

Ensemble classification. Do not rely on Naive Bayes as the sole classifier. Most mature spam filters wrap it into an ensemble with rule-based checks, header analysis, reputation scoring, and sometimes a secondary ML model with different assumptions. An attacker who evades one component still needs to evade the others.

Feature anomaly detection. Monitor for inputs with suspiciously high concentrations of low-information tokens. A message that contains 25 common English words appended in a block after the actual content should trigger a secondary inspection, even if the Naive Bayes score says “ham.”

Disable feedback channels. For Bayesian poisoning specifically, the active attack variant depends on the attacker knowing whether their messages were delivered. Suppressing non-delivery reports, disabling read receipts, and blocking tracking pixels removes the feedback loop that makes active poisoning efficient.

Where this fits in the series

Previous articles established that model assumptions are attack surfaces. Linear regression’s assumption of linearity makes it steerable through outliers. Logistic regression’s sigmoid decision boundary is vulnerable to feature-space manipulation that shifts predictions across the 0.5 threshold. Decision trees’ axis-aligned splits make them exploitable one feature at a time. Naive Bayes follows the same pattern, except the assumption (feature independence) maps so directly to the attack technique (independent token injection) that the evasion almost writes itself.

The next article will cover support vector machines, where the attack surface shifts from feature-level manipulation to the geometry of the decision boundary itself.

The real lesson

Naive Bayes is still deployed in production security systems because it is fast, cheap, and accurate enough for many use cases. That “enough” is doing a lot of work. The model performs well on average inputs because most inputs are not adversarial. The moment an attacker understands the independence assumption, the model’s accuracy on adversarial inputs drops to whatever the attacker wants it to be. The question for any deployment is whether the threat model includes an adversary who has read this article.

Type to search