Datasets and data quality

A model is only as reliable as the data it learned from, and that sentence gets repeated so often in machine learning courses that it has lost its teeth. To a red teamer, this is an operational thesis rather than a platitude. You effectively own the model if you can access the training data. If you can understand how the data was cleaned, you know where the assumptions live, and assumptions are where defences break. This entry steps back from algorithms to look at what feeds them. Datasets, their structure, their quality attributes, and the preprocessing pipeline that transforms raw data into something a model can consume. Every stage of that pipeline is a surface you can probe.

What a dataset is

A dataset is a structured collection of data points used for analysis and model training. The structure varies depending on the problem domain.

Tabular data is organised into rows and columns, the format you see in spreadsheets, CSVs, and relational databases. Most classical machine learning, the scikit-learn territory from the previous entry, operates on tabular data. Image data represents visual information as multi-dimensional arrays of pixel values. Text data is unstructured, composed of tokens, sentences, and documents. Time series data adds a temporal dimension, where the ordering and spacing of observations carry meaning.

From an adversarial standpoint, the data type determines the attack surface. Poisoning tabular data means injecting or modifying rows in a CSV or database table, a task that requires write access to a data pipeline or the ability to influence the data collection process. Poisoning image data might mean manipulating a handful of training images with imperceptible pixel perturbations, a technique that has been demonstrated to work even at very low poisoning rates. Poisoning text data at scale became disturbingly practical when researchers showed that web-scale training datasets like LAION-400M could be contaminated by purchasing expired domains that the dataset’s URLs pointed to, re-hosting modified content, and waiting for the next crawl.

What makes a dataset “good”, and what makes it exploitable

The conventional wisdom lists several properties that define a quality dataset. Relevance, completeness, consistency, accuracy, representativeness, balance, and sufficient size. Each of these properties is a defensive assumption, and each one has a corresponding offensive implication.

Relevance means the data should relate to the problem at hand. Irrelevant features introduce noise. But an attacker can exploit this in reverse. If a model includes features that correlate with the target label only in the training distribution but not in the real world, poisoning those spurious features is cheaper and harder to detect than attacking the genuine signal.

Completeness means the dataset should have minimal missing values. Defenders typically use imputation to fill gaps, replacing missing entries with the column mean, the median, or a value predicted by k-nearest neighbours. An attacker who knows the imputation strategy can craft partially missing inputs that, once imputed, land exactly where they want in feature space. The imputer becomes an unwitting accomplice.

Consistency means uniform formatting. Dates in the same format, categories spelled the same way, numerical columns free of stray strings. Inconsistencies that survive cleaning can cause silent errors in model training, and an attacker who can introduce subtle formatting mismatches into a data pipeline may be able to corrupt training without triggering validation checks.

Representativeness means the dataset should reflect the population it models. A facial recognition system trained predominantly on one demographic will underperform on others, and an attacker can exploit that gap deliberately. But representativeness also creates a subtler problem. If the training data does not cover the adversarial distribution, the regions of input space where attacks live, the model has no learned defence against them. This is the fundamental reason why adversarial examples work. The model was never trained on inputs that look like real data but have been perturbed to cross a decision boundary.

Balance means classes should be proportionally represented. Imbalanced datasets produce models that default to predicting the majority class, which means minority-class attacks, rare event types disguised as normal traffic, are more likely to slip through. An attacker targeting an intrusion detection system does not need to fool the model on every input. They need to fool it on the inputs that matter, and those inputs usually fall in the minority class.

A practical dataset

To ground these ideas, consider a dataset of network log entries, the kind of data that feeds intrusion detection systems and threat classification models. Each row describes a network event with fields like source IP, destination port, protocol, bytes transferred, and a threat level label (0 for normal, 1 for low threat, 2 for high threat).

This structure is typical of security analytics datasets, and it carries every preprocessing challenge at once. The data mixes numerical columns (bytes transferred, destination port) with categorical ones (protocol, source IP). Some numeric columns contain non-numeric strings that slipped in during collection. The threat level column includes unknown values like ? and -1 that do not map to any defined category. Missing values appear across multiple columns.

Each of these flaws is a realistic artefact of how security data gets collected in production environments. Log pipelines aggregate from multiple sources with different schemas, ingest rates, and error handling. The messiness is not a bug in the dataset. It is a faithful representation of what operational data actually looks like.

Loading and exploring data with pandas

Pandas is the standard Python library for tabular data manipulation, and learning its API is not optional if you are working with datasets in any capacity. Loading a CSV into a pandas DataFrame gives you a labelled, queryable structure that supports filtering, transformation, and aggregation.

import pandas as pd

data = pd.read_csv("./demo_dataset.csv")

The first thing to do with any dataset is look at it. data.head() shows the first five rows, which immediately reveals formatting issues, unexpected column names, and obvious data type problems.

print(data.head())

data.info() returns a summary of column names, data types, and non-null counts. This is where you spot columns that should be numeric but are typed as objects (strings), which usually means non-numeric values have contaminated the column.

print(data.info())

data.isnull().sum() counts missing values per column, giving you a map of where the gaps are and how severe they are.

print(data.isnull().sum())

These three commands, head(), info(), and isnull().sum(), are the minimum viable inspection of any new dataset. They take seconds to run and prevent hours of debugging later when a model fails to converge or produces nonsensical predictions because of a data issue that could have been caught upfront.

Why preprocessing is the quiet part of the attack surface

Data preprocessing sits between raw collection and model training. It includes cleaning (removing or fixing invalid entries), encoding (converting categorical values to numbers), scaling (normalising feature ranges), and imputation (filling missing values). In a well-run ML pipeline, preprocessing is deterministic and reproducible. In practice, it is often ad hoc, undocumented, and full of implicit decisions that nobody audits.

Consider the network log dataset. The threat level column contains ? and -1 alongside valid labels. How should a preprocessing pipeline handle those values? If it drops the rows, the training set shrinks and may lose rare but informative samples. If it maps unknown values to 0 (normal traffic), it teaches the model that ambiguous events are benign. If it maps them to the majority class, it reinforces existing biases. Each choice has a different effect on the model’s behaviour, and an attacker who understands the pipeline’s handling of edge cases can exploit whichever assumption was made.

This is the real lesson of dataset quality for red teamers. The danger is rarely in obviously corrupted data. Obvious corruption gets caught by validation checks. The danger is in the quiet decisions: how missing values were filled, which rows were dropped, how categorical variables were encoded, and whether the scaling preserved or destroyed the statistical relationships between features. These decisions are rarely logged, rarely versioned, and almost never tested against adversarial inputs.

Research published in 2025 by Souly and colleagues demonstrated that as few as 250 poisoned documents were sufficient to compromise text-based models across multiple use cases and varying training set sizes. In a separate study on medical language models, researchers injected poisoned content at a rate of just 0.001% of the training data and measured a 4.8% increase in harmful output, while clinical reviewers were unable to distinguish the poisoned responses from clean ones. The poisoning worked because it targeted the same pipeline assumptions that preprocessing relies on. The malicious data looked normal to every automated check. It just shifted the decision boundary in the direction the attacker wanted.

The dataset is the model’s memory

Thirteen entries into this series, we have covered the algorithms that learn from data. This entry covers the data itself, the thing that determines what those algorithms actually learn. A model does not know anything its training data did not contain, and it believes everything its training data did contain. For a defender, that means dataset integrity is the foundation of model integrity. For a red teamer, it means the dataset is the softest target in the entire pipeline. You do not need to reverse-engineer the architecture or steal the weights. You just need to change what the model remembers.

Type to search