Spam classification: Naive Bayes filters

In 2005, Daniel Lowd and Christopher Meek demonstrated that appending 150 carefully chosen words to a blocked spam email was enough to bypass a Naive Bayes filter. The filter did not malfunction. It performed exactly as its mathematics demanded. The independence assumption that makes Naive Bayes fast and trainable on small datasets is the same assumption that makes it trivially exploitable once you understand what each token contributes to the posterior probability.

Earlier in this series, we covered Naive Bayes as a classification algorithm and mapped its independence assumption as an attack surface. This entry puts that theory into practice. Spam filtering is where Bayesian classification was first deployed at scale, and it is where the first formalised adversarial attacks against machine learning were tested. If you want to understand how adversarial ML works in the real world, this is the case study that started it all.

The classifier behind the inbox

Bayesian spam filtering entered mainstream use after Paul Graham published “A Plan for Spam” in 2002, though the foundational academic work came from Sahami, Dumais, Heckerman, and Horvitz in 1998. The idea is straightforward. The filter treats each email as a bag of tokens (words, header fragments, formatting artefacts) and maintains a probability table recording how often each token appears in spam versus legitimate mail. When a new email arrives, the filter calculates the probability that the message is spam given the tokens it contains, and compares that against the probability that it is legitimate.

The calculation relies on Bayes’ theorem:

P(Spam | Features) = (P(Features | Spam) * P(Spam)) / P(Features)

P(Spam) is the prior probability that any given email is spam, learned from the training corpus. P(Features | Spam) is the likelihood of observing this particular set of tokens in a spam email. P(Features) is the total probability of seeing these tokens in any email, spam or otherwise. The output, P(Spam | Features), is the posterior probability, the updated belief that the email is spam after considering the evidence.

The naive assumption and why it matters

Calculating P(Features | Spam) directly is computationally expensive because it requires modelling the joint probability distribution of every possible combination of tokens. Naive Bayes sidesteps this by assuming that each token’s presence is statistically independent of every other token, given the class label. Under that assumption, the joint likelihood decomposes into a product of individual token probabilities:

P(Features | Spam) = P(token_1 | Spam) * P(token_2 | Spam) * ... * P(token_n | Spam)

This is mathematically convenient and computationally cheap, it is also wrong. Words in natural language are not independent. “Nigerian” and “prince” co-occur far more often in spam than chance would predict, and “meeting” and “agenda” cluster together in legitimate corporate email. But the classifier does not care about co-occurrence. It evaluates each token in isolation, multiplies the probabilities together, and outputs a classification.

For legitimate use, the independence assumption works well enough. Naive Bayes spam filters consistently achieve accuracy above 95% on standard corpora, which is why they remain embedded in tools like SpamAssassin, Bogofilter, and Rspamd. For an attacker, the same assumption creates a precise, exploitable mechanism.

Walking through the maths

Consider a simplified example. An email contains two features, F1 and F2, and the filter has learned the following from its training data:

Parameter	Value
P(Spam)	0.3
P(Not Spam)	0.7
P(F1 given Spam)	0.4
P(F2 given Spam)	0.5
P(F1 given Not Spam)	0.2
P(F2 given Not Spam)	0.3

Under the independence assumption, the joint likelihoods are:

P(F1, F2 | Spam)     = 0.4 * 0.5 = 0.20
P(F1, F2 | Not Spam) = 0.2 * 0.3 = 0.06

The total evidence term, P(F1, F2), sums the weighted likelihoods across both classes.

P(F1, F2) = (0.20 * 0.3) + (0.06 * 0.7) = 0.06 + 0.042 = 0.102

Now Bayes’ theorem gives the posterior for each class.

P(Spam | F1, F2)     = 0.06 / 0.102 ≈ 0.588
P(Not Spam | F1, F2) = 0.042 / 0.102 ≈ 0.412

The classifier picks the class with the higher posterior. In this case, 0.588 beats 0.412, and the email is classified as spam.

The critical thing to notice is how each token’s probability contributes multiplicatively to the final score. If an attacker can introduce tokens with high P(token | Not Spam) values, those tokens multiply into the legitimate side of the equation and drag the posterior away from spam. The classifier has no way to distinguish between a token that genuinely belongs in the email and one that was injected specifically to manipulate the posterior.

The GoodWords attack

This is exactly what Lowd and Meek formalised in their 2005 paper at the Conference on Email and Anti-Spam. The GoodWords attack works by appending tokens that are strongly associated with legitimate email (words like “meeting”, “university”, “conference”, “research”) to a spam message. Because Naive Bayes treats each token independently, the injected tokens shift the posterior probability toward “not spam” without affecting how the filter evaluates the original spam content.

Lowd and Meek tested two variants. In a passive attack, the attacker has no access to the target filter and simply guesses which words are likely to score as legitimate, using dictionaries or word frequency lists. In an active attack, the attacker can send test messages to the filter and observe which ones get through, effectively reverse-engineering the probability table. Their results showed that passive attacks required around 150 injected words to get 50% of blocked spam through the filter. Active attacks achieved the same result with roughly 30 words, because the attacker could target the highest-value tokens directly.

Hack The Box’s AI Evasion lab, which implements this attack as a practical exercise, demonstrates that as few as 25 carefully selected tokens can achieve 100% evasion against a Naive Bayes spam classifier. The filter does not crash or error. It classifies correctly according to its own mathematics. The mathematics just happens to be exploitable.

Data poisoning

The GoodWords attack is an evasion attack, meaning it manipulates the input at test time. Poisoning attacks go further by corrupting the training data itself. In 2004, John Graham-Cumming demonstrated at the MIT Spam Conference that a spammer with access to a known spam feed (for example, a blacklisted IP whose messages are automatically labelled as spam for retraining) could craft emails that skew the filter’s probability tables over time. By sending spam messages loaded with legitimate-sounding tokens, the attacker gradually teaches the filter that those tokens are associated with spam, which causes it to start misclassifying genuine emails that contain them.

Miller, Hu, Xiang, and Kesidis at Penn State formalised this threat in 2018, showing that the poor representational capacity of a single Naive Bayes component makes it particularly vulnerable to poisoning. Their defence, a mixture-of-NB-models approach, isolates the poisoned component from the legitimate spam model, but it requires the defender to know that poisoning is occurring in the first place.

The asymmetry is significant. Evasion requires the attacker to modify every spam email they send. Poisoning modifies the filter once, and the corruption persists through every subsequent classification until the model is retrained on clean data.

Why this matters for the series

Spam filtering is the original adversarial machine learning problem. The attack taxonomy that the security community now applies to deep neural networks, evasion, poisoning, model extraction, data manipulation, was first developed and tested against Naive Bayes spam classifiers in the mid-2000s. The techniques are simpler, but the principles are identical.

The independence assumption that makes Naive Bayes vulnerable to token injection is structurally similar to the assumptions that make more complex models exploitable. Neural networks assume that learned feature representations generalise to unseen inputs. Decision trees assume that axis-aligned splits capture the decision boundary. Every model encodes assumptions, and every assumption is a potential attack surface. Spam classification is where that insight was first proven in production.

Defending Bayesian spam filters

Practical defences against these attacks exist, but each one trades something for the protection it provides.

Feature-level defences move beyond single tokens to n-grams (sequences of two or three words), which partially break the independence assumption by capturing local context. An attacker can no longer inject individual tokens without also matching the surrounding bigram and trigram distributions. The cost is a larger feature space and slower training.

Rate-limited retraining restricts how often the model updates from new data, which slows poisoning attacks but also slows the filter’s ability to adapt to new spam campaigns. Ensemble wrapping places the Naive Bayes classifier behind a second model (often an SVM or random forest) that evaluates the classification in context, catching cases where the posterior was shifted by an implausible cluster of injected tokens.

Feedback channel suppression limits the attacker’s ability to observe filter decisions, which degrades active GoodWords attacks to the less effective passive variant. If the attacker cannot tell which test messages were blocked, they cannot efficiently identify high-value tokens.

None of these defences eliminate the underlying vulnerability. They raise the cost of the attack, which in practice is often enough.

The classifier that taught us to think adversarially

Naive Bayes spam classification is the simplest complete example of a machine learning system deployed in an adversarial environment. The model works. The maths is sound. The assumption is reasonable enough for honest data. And the assumption is precisely what breaks when the data is not honest. Every adversarial ML technique in use today, from FGSM perturbations against image classifiers to prompt injection against LLMs, traces a conceptual lineage back to someone appending “university conference research” to a message selling counterfeit watches. The sophistication of the attack has changed. The structure of the vulnerability has not.

Type to search