Feature extraction

The preprocessing pipeline from the previous entry stripped punctuation, removed stop words, stemmed tokens, and rejoined the result into clean strings. The classifier still cannot read any of it. Machine learning models operate on numbers, not words, which means every message must be converted into a numerical vector before training can begin. This conversion step, feature extraction, defines the exact mathematical surface the classifier will learn from, and the exact mathematical surface an attacker can manipulate.

A spam message reading “free prize now” and a ham message reading “meeting at noon” will each become a row of integers in a matrix, where each column corresponds to a term from the dataset’s vocabulary. The classifier will learn to separate these rows. But the vocabulary itself, the set of terms the model is even allowed to see, is built from choices made during feature extraction. Those choices determine what signals survive and what blind spots exist. From a red teaming perspective, every blind spot is an evasion path.

The bag-of-words model

The most straightforward way to represent text numerically is the bag-of-words model. It works by constructing a vocabulary of every unique term in the dataset, then representing each message as a vector of term counts. Each position in the vector corresponds to one term, and the value at that position records how many times the term appears in the message.

The name is literal. The model treats each document as a bag of words, discarding all information about word order. “Claim your free prize” and “prize your free claim” produce identical vectors. For general text understanding, this is a significant loss. For spam classification, it is often acceptable because the presence of certain terms matters more than their arrangement. The word “free” appearing in a message is a stronger spam signal than the specific sentence structure surrounding it.

To recover some ordering information, the model can include bigrams, which are pairs of consecutive words. A unigram vocabulary might contain “free” and “prize” as separate features. A bigram vocabulary adds “free prize” as a distinct feature, capturing the fact that these two words appeared next to each other. The bigram “free prize” is a stronger spam indicator than either word alone, because it reflects a phrase pattern characteristic of promotional spam rather than a coincidental co-occurrence.

However, bigrams only capture local adjacency. The global structure of the sentence, its syntax, its rhetorical flow, its intent, is still lost. The bag-of-words model, even with bigrams, is a lossy compression of language into frequency counts.

CountVectorizer in practice

Scikit-learn’s CountVectorizer implements the bag-of-words approach in three stages. First, it tokenises each message into individual terms and bigrams based on the specified n-gram range. Second, it builds a vocabulary by filtering terms according to frequency thresholds. Third, it transforms each message into a vector of term counts using that vocabulary.

The filtering thresholds are where the security-relevant decisions happen.

from sklearn.feature_extraction.text import CountVectorizer

# Initialise CountVectorizer with bigrams and frequency thresholds
vectorizer = CountVectorizer(min_df=1, max_df=0.9, ngram_range=(1, 2))

# Fit and transform the message column
X = vectorizer.fit_transform(df["message"])

# Convert labels to binary
y = df["label"].apply(lambda x: 1 if x == "spam" else 0)

Three parameters control what enters the vocabulary:

min_df=1 means a term must appear in at least one document to be included. Setting this to 1 keeps every term, including rare ones that might appear in only a single message.
max_df=0.9 removes any term appearing in more than 90% of documents. These are words so common across both spam and ham that they provide no discriminative value.
ngram_range=(1, 2) includes both unigrams (individual words) and bigrams (consecutive word pairs), giving the model access to limited phrase-level patterns.

After this step, X is a sparse matrix where each row is a message and each column is a vocabulary term. The values are raw counts. This matrix is what the classifier trains on.

Walking through the vocabulary construction

Consider five short documents to see how the vocabulary is built and how term filtering works:

“The free prize is waiting for you”
“The spam message offers a free prize now”
“The spam filter might detect this”
“The important news says you won a free trip”
“The message truly is important”

With max_df=0.9, any term appearing in more than 90% of documents is removed. With five documents, the threshold is 4.5, so a term must appear in all five to be excluded. “The” appears in all five and gets dropped. Every other unigram survives because none exceeds the threshold.

The resulting unigram matrix records simple presence and frequency:

Document	free	prize	is	waiting	for	you	spam	message	offers	a	now	filter	might	detect	this	important	news	says	won	trip	truly
1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	1	1	0	0	0	0	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	1	0	0	0	0	1	1	1	1	0	0	0	0	0	0
4	1	0	0	0	0	1	0	0	0	1	0	0	0	0	0	1	1	1	1	1	0
5	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	0	1

With ngram_range=(1, 2), the vocabulary expands to include bigrams. “Free prize” now appears as its own feature in Documents 1 and 2, separate from the individual unigrams “free” and “prize”. The bigram “spam filter” captures a phrase pattern absent from individual word counts, and “is important” links two words that individually carry weak signal but together indicate a specific type of ham message.

Document	free	prize	…	free prize	prize is	is waiting	spam message	spam filter	is important	…
1	1	1	…	1	1	1	0	0	0	…
2	1	1	…	1	0	0	1	0	0	…
3	0	0	…	0	0	0	0	1	0	…
4	1	0	…	0	0	0	0	0	0	…
5	0	0	…	0	0	0	0	0	1	…

The full bigram matrix is considerably wider than the unigram-only version. Every pair of consecutive words that survives the frequency filters becomes a new column. In a real dataset with thousands of messages, this expansion can produce vocabularies of tens or hundreds of thousands of features, most of which are zero for any given message. The resulting matrix is extremely sparse.

The feature space as an attack surface

Everything described so far is standard machine learning practice. From a red teaming perspective, the interesting question is what this feature space looks like to an attacker.

The bag-of-words model creates a specific and well-studied vulnerability known as the good word attack, first formalised by Lowd and Meek in 2005. The attack is straightforward. Because the classifier treats each message as a bag of independent term counts, an attacker can shift a spam message’s feature vector toward the ham region of the feature space by injecting words that are strongly associated with legitimate messages. The classifier sees the counts, not the coherence. A message reading “FREE CASH NOW meeting agenda quarterly review budget” is nonsensical to a human reader, but to a bag-of-words classifier, it looks like a message that shares vocabulary with both spam and ham, and the injected ham words may be enough to tip the classification.

This attack works precisely because the bag-of-words representation discards word order. The classifier has no way to know that “meeting agenda quarterly review budget” was appended to the end of a spam message rather than being part of a genuine business communication. The features for those words activate identically regardless of context.

The frequency thresholds introduce their own attack surface. Setting min_df=1 means every term in the training corpus is included, even rare ones. If a specific rare word appears only in ham messages during training, it becomes a reliable ham signal that an attacker can co-opt. Conversely, max_df=0.9 removes universally common terms, which means the model is blind to any manipulation that operates purely through high-frequency vocabulary. An attacker who understands which words were filtered out during vocabulary construction knows exactly which words they can use without triggering any feature at all.

Bigrams add discriminative power, but they also add predictable structure. If the classifier has learned that “free prize” is a strong spam bigram, an attacker can break the bigram by inserting a word between “free” and “prize”, because “free your prize” produces the bigrams “free your” and “your prize” instead of “free prize”. The spam signal vanishes while the individual unigrams remain, and the unigrams alone may not carry enough weight to trigger classification.

The sparse matrix problem

The feature matrices produced by CountVectorizer are overwhelmingly sparse. In a typical SMS dataset, a message might activate 10 to 30 features out of a vocabulary of 50,000 or more. That means over 99.9% of each message’s feature vector is zeros.

This sparsity has a direct security implication. An attacker does not need to manipulate many features to shift a message’s classification. In a high-dimensional sparse space, small changes to the non-zero entries, or strategic additions of new non-zero entries, can move a data point a significant distance in feature space relative to the decision boundary. Research by Zhang, Chan, Biggio, Yeung, and Roli has shown that classifiers with unevenly distributed feature weights are particularly vulnerable because a single high-weight feature can dominate the classification, and flipping that feature’s presence changes the outcome. Their work also demonstrated that adversary-aware feature selection, where the expected attacker manipulation strategy is incorporated into the feature selection criterion, measurably improves robustness.

The sparsity also means that the model’s effective decision is often based on a handful of features per message. For a red teamer, this is an invitation. If you can identify which features carry the most weight for a given classifier, you know exactly which terms to add or remove to flip the result. The feature extraction step has compressed each message into a numerical fingerprint, and that fingerprint can be reverse-engineered.

What the feature matrix encodes, and what it does not

After feature extraction, the variable X contains the numerical matrix that the classifier will train on, and y contains the binary labels. Every subsequent step in the pipeline, model selection, training, evaluation, operates on this representation. The model will never see the original text. It will only see term counts in the dimensions defined by the vocabulary.

This means that every assumption baked into the feature extraction step propagates forward. The choice to discard word order, the frequency thresholds that define the vocabulary boundaries, the decision to include bigrams but not trigrams, all of these become fixed properties of the classifier’s worldview. The model cannot learn patterns that the feature representation has already destroyed.

From an adversarial perspective, the preprocessing pipeline (covered in the previous entry) and the feature extraction step (covered here) together define the complete transformation chain from raw text to model input. An attacker who understands both stages can craft messages that survive preprocessing intact and land in the exact region of the feature space that the classifier associates with legitimate messages. The model is only as robust as the assumptions embedded in its feature representation, and every assumption is a potential exploit.

The next entry will train the Naive Bayes classifier on this feature matrix and examine what the model actually learns from these term counts, including how the learned decision boundary responds when an adversary starts pushing against it.

Type to search