Preprocessing the spam dataset

A spammer sends “FR33 C4SH N0W!!!” and the filter lets it through, because someone decided during preprocessing that numbers should be stripped and case should be normalised. The message became “fr cash now” before the classifier ever saw it. Three of the four strongest spam indicators vanished before inference even started.

The previous entry in this series loaded the SMS Spam Collection dataset and framed Naive Bayes spam classification as an adversarial problem. This entry covers what happens between loading the data and training the model. Preprocessing is where you decide what the classifier is allowed to see, and every decision you make here either closes an evasion path or opens one. A red teamer who understands how text preprocessing works can craft payloads that survive the pipeline intact while legitimate spam signals get stripped away.

Setting up the toolkit

Before any text transformation, the required Natural Language Toolkit (NLTK) resources need to be available. NLTK is a Python library purpose-built for text processing, and it ships its data files separately from the package itself. Downloading them at the start of the pipeline prevents silent failures later when a tokeniser or stop word list is referenced but missing.

import nltk

# Download the necessary NLTK data files
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")

print("=== BEFORE ANY PREPROCESSING ===")
print(df.head(5))

At this stage, the dataset contains raw SMS messages exactly as they were collected. Some are clean, some are riddled with abbreviations and symbols, and the classifier cannot work with any of them in their current form. The preprocessing pipeline that follows will transform this raw text into something a Naive Bayes model can reason about, and every step in that pipeline encodes an assumption about what matters.

Lowercasing

The first transformation converts all message text to lowercase.

# Convert all message text to lowercase
df["message"] = df["message"].str.lower()
print("\n=== AFTER LOWERCASING ===")
print(df["message"].head(5))

This ensures the classifier treats “Free” and “free” as the same token. Without lowercasing, the model would allocate separate probability estimates to each variant, fragmenting the evidence for what is functionally the same word. In a Naive Bayes context, where each token’s contribution to the posterior probability depends on how often it appeared in spam versus ham during training, splitting a single high-signal word across multiple case variants dilutes its discriminative power.

From an adversarial perspective, lowercasing closes one of the oldest evasion tricks in spam filtering. Mixed-case obfuscation, writing “FrEe” or “FREE” to dodge filters trained on lowercase “free”, stops working the moment the pipeline normalises case before classification. This is a net positive for the defender. But lowercasing is also lossy in ways that matter. ALL-CAPS text in SMS messages is often a behavioural signal for spam. An entirely uppercase message carries different intent than the same words in lowercase, and that distinction disappears after this step.

Removing punctuation and numbers

The next step strips characters that are unlikely to contribute to classification, but does so selectively.

import re

# Remove non-essential punctuation and numbers, keep useful symbols like $ and !
df["message"] = df["message"].apply(lambda x: re.sub(r"[^a-z\s$!]", "", x))
print("\n=== AFTER REMOVING PUNCTUATION & NUMBERS (except $ and !) ===")
print(df["message"].head(5))

This regex removes everything except lowercase letters, whitespace, dollar signs, and exclamation marks. The decision to preserve $ and ! is deliberate. Dollar signs appear disproportionately in spam messages that reference money, prizes, or financial offers, and exclamation marks correlate with the urgent, high-pressure tone that spam typically adopts. Stripping them would discard genuinely discriminative features.

Numbers, on the other hand, are removed entirely. This is where the preprocessing pipeline starts making opinionated trade-offs. Phone numbers, prices, and PIN codes appear frequently in both spam and ham messages, but they rarely generalise well as features because specific numbers almost never repeat across messages. Removing them reduces dimensionality without losing much predictive signal for a bag-of-words model.

A red teamer would note the gap this creates. Spammers who encode their payload in numbers (“text 80808 to win” becomes “text to win” after stripping) can rely on the preprocessing pipeline to sanitise the most actionable part of the message, the call-to-action number, while leaving enough text for the message to still reach the recipient. The classifier never sees the number. The recipient does, because the number is in the original message that the mail client displays, not the preprocessed version the model evaluates.

Tokenisation

Tokenisation splits each message string into a list of individual words.

from nltk.tokenize import word_tokenize

# Split each message into individual tokens
df["message"] = df["message"].apply(word_tokenize)
print("\n=== AFTER TOKENIZATION ===")
print(df["message"].head(5))

This step converts unstructured text into a sequence of discrete units that downstream operations can manipulate independently. NLTK’s word_tokenize function uses the Punkt tokeniser, which is trained to handle English sentence boundaries and understands that a full stop after “Dr” is not the end of a sentence. For SMS text, which is informal and inconsistently punctuated, this is better than naive whitespace splitting but still imperfect.

Why tokenisation matters from an adversarial standpoint is less about the splitting itself and more about what it makes visible. Once a message is tokenised, every subsequent operation, from stop word removal to stemming, operates on individual tokens. An attacker who understands the tokeniser’s behaviour can craft messages that split in unexpected ways. Unicode tricks, zero-width characters, and unusual whitespace can cause a tokeniser to produce tokens that look normal to a human reader but do not match any entry in the model’s vocabulary. Those unrecognised tokens are effectively invisible to the classifier.

Removing stop words

Stop words are high-frequency words like “the”, “and”, “is”, and “to” that carry grammatical function but little semantic meaning.

from nltk.corpus import stopwords

# Define a set of English stop words and remove them from the tokens
stop_words = set(stopwords.words("english"))
df["message"] = df["message"].apply(lambda x: [word for word in x if word not in stop_words])
print("\n=== AFTER REMOVING STOP WORDS ===")
print(df["message"].head(5))

Removing these tokens reduces noise. In a Naive Bayes model, every token contributes to the posterior probability calculation. Stop words appear in roughly equal proportions across spam and ham, which means they contribute almost nothing to the classification decision but inflate the feature space and add computational cost.

This is also the step most directly connected to the GoodWords attack described in the previous entry. Lowd and Meek showed in 2005 that appending high-frequency “ham” words to spam messages shifts the posterior probability away from the spam class. Stop words are the purest example of ham-correlated tokens, they appear constantly in legitimate messages. If stop words were not removed during preprocessing, an attacker could trivially pad spam messages with “the the the the” to dilute the spam signal. Stop word removal closes this particular vector, forcing an attacker to find less obvious ham-correlated tokens to use as padding.

But the defence is not free. NLTK’s English stop word list contains 179 words, and it is static. It does not adapt to the specific corpus. Some words that NLTK considers stop words may actually carry signal in SMS spam detection. The word “you”, for instance, is on NLTK’s stop word list, but appears far more frequently in spam messages (“You have won!”, “You are selected!”) than in typical ham. Removing it discards a feature that, in this particular domain, carries genuine predictive value.

Stemming

Stemming reduces each token to its root form using a rule-based algorithm.

from nltk.stem import PorterStemmer

# Stem each token to reduce words to their base form
stemmer = PorterStemmer()
df["message"] = df["message"].apply(lambda x: [stemmer.stem(word) for word in x])
print("\n=== AFTER STEMMING ===")
print(df["message"].head(5))

The Porter stemmer applies a sequence of suffix-stripping rules to map inflected words to a common base form. “Running”, “runs”, and “ran” all become “run”. “Winning”, “winner”, and “wins” all collapse to “win”. This consolidation reduces the vocabulary size, which means the model sees more examples per token and can estimate probabilities more reliably, especially on small datasets like the SMS Spam Collection’s 5,572 messages.

Stemming is aggressive by design. The Porter algorithm does not check whether the resulting stem is a real English word, only that the transformation follows its rule set. “University” becomes “univers”, “presumably” becomes “presum”, and “happiness” becomes “happi”. This is fine for a bag-of-words classifier that treats tokens as opaque identifiers rather than meaningful words. The stem does not need to be readable. It just needs to be consistent.

The adversarial implication is that stemming merges tokens that a more careful analysis might want to keep separate. “Claim” and “claiming” collapse to the same stem, but “claim your prize” and “claiming expenses” have very different spam probabilities. By merging them, the model’s estimate for the stem “claim” becomes an average of its legitimate and spam-associated uses, weakening its discriminative power for both. An attacker benefits from this blurring whenever a strongly spam-associated word shares a stem with a common legitimate word.

Rejoining tokens

The final step converts the processed token lists back into space-separated strings.

# Rejoin tokens into a single string for feature extraction
df["message"] = df["message"].apply(lambda x: " ".join(x))
print("\n=== AFTER JOINING TOKENS BACK INTO STRINGS ===")
print(df["message"].head(5))

This is a formatting step rather than a transformation. Many vectorisation methods, including the TF-IDF vectoriser commonly used with Naive Bayes, expect raw strings as input and perform their own internal tokenisation. Rejoining the tokens restores compatibility with these tools while preserving all the cleaning, filtering, and normalisation applied in the preceding steps.

At this point, each message in the dataset is a cleaned, normalised string containing only stemmed, lowercase, non-stop-word tokens with dollar signs and exclamation marks preserved. The data is ready for feature extraction and model training.

The preprocessing pipeline as an attack surface

Zooming out from individual steps, the pipeline as a whole is a sequence of irreversible lossy transformations. Each step discards information, and the classifier can only learn from what survives. For a defender, this means every preprocessing decision should be evaluated not just for its effect on accuracy metrics, but for what evasion paths it opens or closes.

The choices made in this pipeline are reasonable for an introductory spam classifier, but they are not adversary-aware. An adversary-aware pipeline would consider domain-specific stop word lists instead of NLTK’s generic one, would evaluate whether number removal discards actionable signals, and would test whether the stemmer’s merging behaviour creates exploitable ambiguities. Research by Zhang, Chan, Biggio, Yeung, and Roli has shown that even the feature selection stage of a text classification pipeline can be targeted by evasion attacks, and that adversary-aware feature selection, where the expected attacker manipulation strategy is incorporated into the selection criterion, measurably improves classifier robustness.

The next entry in this series will move from preprocessing into feature extraction and model training, where the cleaned text becomes the numerical representation the classifier actually operates on.

Type to search