Manipulating a model

Swap one word in a spam message and the classifier waves it through. Inject four mislabelled rows into the training set and the same classifier flags “Hello World” as malicious, with 99.6% confidence and barely a scratch on overall accuracy. These are the two faces of model manipulation, and they sit at positions one and two on the OWASP Machine Learning Security Top 10 for a reason.

What the OWASP ML Top 10 is telling us

The OWASP ML Security Top 10 ranks the most pressing security risks to machine learning systems. ML01 (Input Manipulation Attack) covers evasion at inference time, where an attacker crafts input that causes a deployed model to misclassify. ML02 (Data Poisoning Attack) targets the training pipeline itself, corrupting the data the model learns from so that the resulting model behaves the way the attacker wants it to. Both attacks exploit the same underlying reality: ML models are mathematical functions that map inputs to outputs, and those mappings can be gamed if you understand how the model weighs its features.

Input manipulation: tricking the model at inference time

An input manipulation attack targets a model that is already trained and deployed. The attacker’s goal is to craft an input that the model misclassifies while still achieving its intended purpose against the human victim. In the context of a spam classifier, that means getting a phishing message past the filter and into the inbox.

The baseline we are working from is a Naive Bayes classifier trained on a labelled SMS dataset. The training and evaluation code is straightforward:

model = train("./train.csv")
acc = evaluate(model, "./test.csv")
print(f"Model accuracy: {round(acc*100, 2)}%")

Model accuracy: 97.2%

A 97.2% accuracy on the test set is a solid classifier. The question is how hard it is to get a malicious message past it.

Probing the decision boundary

Before crafting an evasion payload, you need to understand which features the model is sensitive to. With a Naive Bayes spam classifier, this is relatively straightforward because the model treats each word as an independent contributor to the final probability. You can isolate individual words or phrases and observe how the model’s confidence shifts.

The starting point is a trained classifier and a way to inspect its output probabilities rather than just the final label. The classify_messages function supports a return_probabilities keyword argument that gives you the raw confidence scores for both classes:

model = train("./train.csv")

message = "Hello World! How are you doing?"

predicted_class = classify_messages(model, message)[0]
predicted_class_str = "Ham" if predicted_class == 0 else "Spam"
probabilities = classify_messages(model, message, return_probabilities=True)[0]

print(f"Predicted class: {predicted_class_str}")
print("Probabilities:")
print(f"\t Ham: {round(probabilities[0]*100, 2)}%")
print(f"\tSpam: {round(probabilities[1]*100, 2)}%")

Running this against different fragments of a known spam message builds a map of the model’s decision boundary:

Input message	Spam probability	Ham probability
Congratulations!	64.97%	35.03%
Congratulations! You won a prize.	99.73%	0.27%
Click here to claim: https://bit.ly/3YCN7PF	99.34%	0.66%
https://bit.ly/3YCN7PF	87.29%	12.71%

The table tells you which words carry the strongest spam signal. “Congratulations” alone is already enough to tip the classifier past the 50% threshold, and combining it with prize language pushes confidence to near-certainty. Even the shortened URL in isolation scores 87% spam. This is the attacker’s reconnaissance phase, and it requires nothing more than repeated inference calls.

Technique 1: rephrasing

The simplest evasion technique is to rewrite the message using vocabulary that the model associates with legitimate content rather than spam. If “Congratulations! You won a prize” scores 99.73% spam, you change the scenario entirely. Swap the prize narrative for an account security warning and feed it the same malicious URL:

Your account has been blocked. You can unlock your account in the next 24h: https://bit.ly/3YCN7PF

The classifier’s output flips:

Predicted class: Ham
Probabilities:
     Ham: 57.39%
    Spam: 42.61%

It barely clears the threshold, but it clears it. The phishing link is identical; only the social engineering wrapper changed.

This works because the classifier learned statistical associations between words and labels from its training data. Spam in training sets tends to cluster around prizes, winnings, and congratulatory language. Account security warnings, on the other hand, appear more frequently in legitimate communications. The attacker is surfing the distribution of the training data without ever needing to see it directly.

Technique 2: overpowering

Overpowering takes a different approach. Instead of replacing spam-associated words, you bury them under a mass of benign content. You keep the original malicious message intact and append enough legitimate-sounding text to shift the classifier’s aggregate probability toward ham.

With Naive Bayes, this is particularly effective because of the independence assumption. Each word contributes independently to the posterior probability, so flooding the input with ham-associated tokens can overwhelm the spam signal entirely. Appending the first sentence of a Lorem Ipsum translation to the original 100%-confidence spam message produces this input:

Congratulations! You won a prize. Click here to claim: https://bit.ly/3YCN7PF. But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness.

The classifier’s verdict:

Predicted class: Ham
Probabilities:
     Ham: 100.0%
    Spam: 0.0%

The original spam payload is still sitting right there in the text, but the model is now 100% confident the message is benign.

The practical power of this technique comes from contexts where the appended content can be hidden from the human reader. In HTML emails, the padding text can sit inside comments or invisible elements. The spam filter processes the raw text including the comments, while the victim sees only the phishing message rendered in their email client. Skylight Cyber demonstrated a production-scale version of this principle in 2019 when they bypassed BlackBerry Cylance’s ML-based antivirus. The researchers found that CylancePROTECT had a strong bias toward code from a specific whitelisted gaming application. By extracting strings from that game’s executable and appending them to malware samples, they flipped the engine’s verdict from malicious to benign. The technique worked against 100% of the top 10 malware samples for May 2019 and roughly 90% of a broader set of 384 samples. The underlying mechanism is the same as text overpowering: flood the feature space with enough benign indicators and the model’s learned associations do the rest.

Data poisoning: corrupting the model before it deploys

If input manipulation is about tricking a model that already exists, data poisoning is about breaking the model before it is even built. The attacker injects carefully crafted samples into the training data so that the resulting model learns the wrong associations.

How little data it takes

The instinct is to assume that poisoning requires massive volumes of fake data. In practice, it takes surprisingly little. To demonstrate this on a manageable scale, you extract a small subset of the training data to work with:

head -n 101 train.csv > poison.csv

Training on this reduced set of 100 samples still yields 94.4% accuracy, which is impressive for so little data but also means the model is more sensitive to changes in the training distribution. That sensitivity is exactly what makes the demonstration visible.

The classifier trained on the clean 100-sample set classifies “Hello World! How are you doing?” as ham with 98.7% confidence. To flip that prediction, you inject two mislabelled entries into poison.csv:

spam,Hello World
spam,How are you doing?

After retraining, the model’s output for the same input shifts dramatically:

Predicted class: Spam
Probabilities:
     Ham: 20.34%
    Spam: 79.66%

Two rows were enough. To push confidence higher, you add two more entries that overlap the target phrase’s word combinations while avoiding exact duplicates (the pipeline deduplicates before training):

spam,Hello World! How are you
spam,World! How are you doing?

The result after retraining:

Predicted class: Spam
Probabilities:
     Ham: 0.4%
    Spam: 99.6%

The critical part is what happened to overall accuracy. Adding the evaluation code back in reveals the damage was almost invisible:

Model accuracy: 94.0%
Predicted class: Spam
Probabilities:
     Ham: 0.4%
    Spam: 99.6%

Four rows changed the model’s behaviour on the target input while only dropping overall accuracy from 94.4% to 94.0%, a 0.4% decline that would not raise an eyebrow in any standard evaluation pipeline.

This scales in predictable ways. Larger training sets require more poisoned samples to shift the decision boundary, but the ratio remains small relative to the total dataset size. Research by Souly et al. (2025) found that as few as 250 poisoned documents could successfully backdoor large language models across a range of parameter counts, from 600 million to 13 billion. The volume of poison needed is a function of the model architecture and the specificity of the target behaviour, not a fixed proportion of the training set.

Why it is hard to detect

Data poisoning is difficult to catch for several reasons that compound in production environments.

First, the overall accuracy drop is negligible. A 0.4% accuracy decline on a test set is well within normal variance for most ML systems, especially during retraining cycles where the training data changes between runs anyway. Standard evaluation metrics will not flag it.

Second, the poisoned behaviour is targeted. The model performs normally on the vast majority of inputs. Only the specific patterns the attacker seeded into the training data trigger the misclassification. Unless you happen to test against exactly those patterns, the poison stays hidden.

Third, the poisoned entries look plausible in isolation. A training sample that says “Hello World” labelled as spam is unusual but not obviously malicious to a human reviewer scanning thousands of rows. In larger and more complex datasets, where training data is scraped from the web, pulled from user submissions, or aggregated from multiple sources, the attack surface for injection is enormous.

A 2024 study cited by Hartle et al. (2025) tested poisoned medical misinformation data against fifteen clinicians and found that the reviewers could not distinguish poisoned responses from clean baselines. When concept-specific data was poisoned at just 0.001%, harmful content increased by 4.8%. The poison was invisible to domain experts working directly with the output.

Targeted vs. indiscriminate poisoning

Data poisoning attacks fall into two broad categories. Indiscriminate (or availability) attacks aim to degrade the model’s overall performance, effectively making it unusable. Targeted (or integrity) attacks aim to cause specific misclassifications on chosen inputs while keeping overall performance intact. The spam classifier exercise above is a targeted attack: the goal was to make one particular phrase misclassify without destroying the model’s general accuracy. Targeted attacks are harder to detect and more useful to an attacker with specific objectives, which is why they feature so prominently in adversarial ML research.

What this means for defensive practice

The defence against input manipulation is not to build a better single classifier. It is to assume the classifier will be evaded and layer accordingly. Ensemble methods that combine multiple model architectures reduce the attacker’s ability to find a single evasion vector that works universally. Input preprocessing, such as stripping HTML comments before classification or normalising URLs, removes the channels attackers use to hide overpowering content.

For data poisoning, the priority is training data integrity. That means provenance tracking for every data source, anomaly detection on incoming training samples before they enter the pipeline, and evaluation against targeted test cases rather than relying solely on aggregate accuracy. Ensemble-based defences that train multiple models on disjoint subsets of the training data can limit the impact of poisoned samples, since the poison needs to land in multiple subsets to affect the ensemble’s consensus.

The harder truth is that automated retraining pipelines, where models ingest new data and retrain on a schedule without human review, are the highest-risk surface for poisoning. If the pipeline pulls from user-submitted data, web scrapes, or any source the attacker can influence, the injection point is already open.

The uncomfortable arithmetic

Four rows in a hundred changed a classifier’s mind about a specific input. The accuracy drop was smaller than rounding error. That is the uncomfortable arithmetic of model manipulation: the attacker does not need to break the model. They just need to bend it in the right place, and the standard metrics you rely on to validate performance will tell you everything is fine.

Type to search