Data preprocessing

Your intrusion detection model missed a command-and-control callback because six months ago, during preprocessing, someone replaced every missing threat_level value with 0. The model learned that ambiguity means safety and the attacker learned that too. The previous entry in this series covered datasets as attack surfaces, how the structure and quality of training data determine what a model believes. This entry goes deeper into the mechanics. Data preprocessing is where raw observations become training signal, and every transformation applied during that process encodes an assumption about the world. For a red teamer, each of those assumptions is a potential lever.

What preprocessing actually does

Raw data is rarely fit for direct consumption by a machine learning algorithm. Values are missing, formats are inconsistent, scales vary by orders of magnitude, and entire columns may contain entries that violate the domain’s basic constraints. Preprocessing is the set of operations that transforms this mess into something a model can learn from.

The standard pipeline has four stages. Data cleaning handles missing values, removes duplicates, and smooths noise. Data transformation normalises, encodes, and scales features into ranges the algorithm expects. Data integration merges data from multiple sources and resolves conflicts between them. Data formatting converts types and reshapes structures so the data fits the model’s input requirements.

Each stage makes decisions. Decisions about what counts as missing, what counts as invalid, what to do with ambiguous entries, and how to represent categorical information as numbers. Those decisions are rarely treated as security-relevant, which is exactly what makes them useful to an attacker.

Validation as a boundary definition

Before cleaning or transforming anything, the pipeline needs to determine what counts as valid data. In a network security dataset, that means defining explicit boundaries for every column and flagging everything that falls outside them.

An IP address is valid if it matches the IPv4 format, four octets between 0 and 255 separated by dots. A port number is valid if it falls within the range 0 to 65535. A protocol value is valid if it appears in a known set like TCP, UDP, HTTP, DNS, SSH, and so on. Bytes transferred must be numeric and non-negative. Threat levels must fall within whatever discrete range the labelling scheme defines.

In Python with Pandas, these checks are straightforward:

import re
import pandas as pd

def is_valid_ip(ip):
    pattern = re.compile(
        r'^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}'
        r'(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$'
    )
    return bool(pattern.match(str(ip)))

def is_valid_port(port):
    try:
        return 0 <= int(port) <= 65535
    except (ValueError, TypeError):
        return False

valid_protocols = ['TCP', 'TLS', 'SSH', 'POP3', 'DNS', 'HTTPS', 'SMTP', 'FTP', 'UDP', 'HTTP']

invalid_ips = data[~data['source_ip'].astype(str).apply(is_valid_ip)]
invalid_ports = data[~data['destination_port'].apply(is_valid_port)]
invalid_protocols = data[~data['protocol'].isin(valid_protocols)]

Each validation function draws a line between data the pipeline will accept and data it will reject. From a red teaming perspective, those lines are the model’s perimeter. Anything that passes validation enters the training set and shapes the model’s understanding of normal. Anything that fails validation gets dropped or imputed, which means the model never learns from it at all.

The adversarial question is not whether the validation logic is correct. It is whether an attacker can craft inputs that are technically valid according to these checks but semantically misleading. An IP address of 10.0.0.1 passes the regex, but if every legitimate entry in the training set comes from the 192.168.x.x range, that valid-looking address may distort the model’s learned distribution of normal source addresses.

The drop-or-impute decision

Once invalid entries are identified, the pipeline has to decide what to do with them. This decision has a direct effect on the model’s behaviour, and each option creates a different kind of vulnerability.

Dropping invalid entries

The simplest approach is to discard every row that contains an invalid value:

data = data.drop(invalid_ips.index, errors='ignore')
data = data.drop(invalid_ports.index, errors='ignore')
data = data.drop(invalid_protocols.index, errors='ignore')

In a dataset of 100 entries where 23 fail validation, dropping leaves 77 clean rows. The remaining data is accurate, but the model has learned from a smaller and potentially less representative sample.

For a red teamer, the vulnerability here is selective attrition. If an attacker can introduce entries that fail validation in a targeted way, perhaps by corrupting records associated with a specific traffic pattern, they can cause the pipeline to drop exactly the samples that would have taught the model to detect that pattern. The model does not learn what it never sees.

Imputing invalid entries

The alternative is to replace invalid values with estimates derived from the remaining data. This preserves sample size but introduces synthetic values that the model will treat as real observations.

The first step is to standardise all invalid markers into a single representation. Entries like MISSING_IPINVALID_IPSTRING_PORTNON_NUMERIC, and ? all get converted to NaN:

import numpy as np

invalid_markers = ['INVALID_IP', 'MISSING_IP', 'STRING_PORT',
                   'UNUSED_PORT', 'NON_NUMERIC', 'NEGATIVE', '?']
df.replace(invalid_markers, np.nan, inplace=True)

df['destination_port'] = pd.to_numeric(df['destination_port'], errors='coerce')
df['bytes_transferred'] = pd.to_numeric(df['bytes_transferred'], errors='coerce')
df['threat_level'] = pd.to_numeric(df['threat_level'], errors='coerce')

Once everything is NaN, the imputation strategy determines what values replace them.

Imputation strategies and what they teach the model

Simple imputation fills missing numeric values with the column’s median or mean, and missing categorical values with the most frequent category:

from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy='median')
df[['destination_port', 'bytes_transferred', 'threat_level']] = (
    num_imputer.fit_transform(df[['destination_port', 'bytes_transferred', 'threat_level']])
)

cat_imputer = SimpleImputer(strategy='most_frequent')
df[['protocol']] = cat_imputer.fit_transform(df[['protocol']])

Median imputation fills every missing port number with the same value. If port 443 is the median, the model learns that ambiguous traffic defaults to HTTPS. Mean imputation for bytes_transferred pulls every missing value toward the centre of the distribution, smoothing out exactly the kind of variance that might distinguish malicious traffic from benign. Most-frequent imputation for protocol assigns the dominant protocol to every gap, reinforcing the majority class at the expense of rarer but potentially more informative entries.

Each of these choices is a bet about what the missing data would have looked like if it were present. An attacker who knows which strategy is in use can predict what the imputed values will be and craft inputs that exploit those predictions.

KNN imputation

A more sophisticated approach uses K-nearest neighbours to fill missing values based on the values of similar rows:

from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5)
df[['destination_port', 'bytes_transferred', 'threat_level']] = (
    knn_imputer.fit_transform(df[['destination_port', 'bytes_transferred', 'threat_level']])
)

KNN imputation considers the relationship between features, so a missing port number gets filled based on what similar traffic patterns typically use rather than the global median. This produces more contextually plausible values, but it also means the imputed values are a function of the local neighbourhood in feature space.

The adversarial implication is that an attacker who can position poisoned samples near the neighbours of a target row can influence what values get imputed. Research published in 2024 demonstrated that adversarial missingness attacks, where an attacker strategically introduces missing entries rather than corrupted values, can exploit specific remediation strategies to change statistical outcomes. The study showed that attacks against mean imputation, complete case analysis, and regression-based imputation could alter p-values from significant to insignificant using less than 20% missingness.

Domain constraints as a final gate

After imputation fills the gaps, domain-specific validation acts as a sanity check:

valid_protocols = ['TCP', 'TLS', 'SSH', 'POP3', 'DNS', 'HTTPS', 'SMTP', 'FTP', 'UDP', 'HTTP']
df.loc[~df['protocol'].isin(valid_protocols), 'protocol'] = df['protocol'].mode()[0]

df['source_ip'] = df['source_ip'].fillna('0.0.0.0')
df['destination_port'] = df['destination_port'].clip(lower=0, upper=65535)

Port values get clipped to the valid range. Invalid protocols get replaced with the mode. Missing IPs default to 0.0.0.0.

The 0.0.0.0 default is worth examining. It is technically valid and passes every format check, but it is not a real source address. If the model encounters enough rows with 0.0.0.0 as the source IP, it learns that this address is part of normal traffic. An attacker spoofing packets from 0.0.0.0 would match the learned distribution of “normal” sources, because the preprocessing pipeline taught the model to expect it.

Clipping port values to the valid range has a similar effect. If an imputed value lands at 65535 after rounding, the clip preserves it. The model now has training examples where port 65535 appears with the statistical profile of whatever was actually missing from the data, not with the profile of real traffic on that port.

Why this matters for red teaming

The preprocessing pipeline is typically written once, tested against the initial dataset, and then left in place as the model is retrained on new data. It is rarely versioned with the same discipline as the model architecture. It is almost never tested against adversarial inputs.

In 2025, researchers at Sonatype documented over 18,000 malicious open-source packages targeting AI ecosystems including PyTorch, TensorFlow, and Hugging Face. Some of those packages targeted data preprocessing tools specifically, because compromising a normalisation function or an imputation routine is a more durable form of poisoning than corrupting individual training samples. The malicious logic persists across retraining cycles.

For a red teamer studying this series, the lesson from preprocessing is structural. You do not need to understand the model’s architecture to influence its behaviour. You need to understand the pipeline that sits between the raw world and the model’s training loop. Every fillna, every clip, every SimpleImputer(strategy='median') is an assumption encoded as code, and assumptions that nobody audits are assumptions that nobody defends.

Leave a Reply

Your email address will not be published. Required fields are marked *

RELATED

Datasets and data quality

Entry 14 in the AI red teaming series. How datasets structure, quality assumptions, and preprocessing pipelines create attack surfaces for…

SARSA

Entry 12 in the AI red teaming series. How SARSA on-policy learning bakes exploration into value estimates, and why that…

Logistic regression

How logistic regression works, why it is the most common classifier in security systems, and how red teamers exploit its…

Linear regression

Linear regression powers SIEM scoring, fraud detection, and baselines. Here is how it works, and why red teamers need to…