Python libraries for AI red teaming

Every algorithm we have covered in this series, from linear regression through to SARSA, eventually needs to run somewhere. Theory gets you the intuition, but the moment you want to train a classifier, poison a dataset, or craft an adversarial input, you need code. Two Python libraries dominate the space, and understanding their APIs is the difference between reading about attacks and executing them.

The two-library split

The AI tooling ecosystem has settled into a clean division of labour. Scikit-learn handles classical machine learning, the algorithms that operate on structured, tabular data and produce interpretable models. PyTorch handles deep learning, where models are large neural networks trained on raw signals like images, audio, and text. If you have worked through this series in order, every algorithm from entries 3 through 9 (supervised learning, linear regression, logistic regression, decision trees, anomaly detection, SVMs, ensemble methods) maps directly to scikit-learn. The reinforcement learning entries (Q-learning, SARSA) lean more towards PyTorch territory, though simple tabular implementations can live in pure NumPy.

For a red teamer, both libraries are operational necessities. Scikit-learn is where you build a malicious classifier to sort exfiltrated data, train a model to evade a spam filter, or replicate a target’s decision boundary from stolen predictions. PyTorch is where you craft adversarial examples against image classifiers, run gradient-based evasion attacks, or extract a neural network’s weights through carefully chosen queries. You will use both, often in the same engagement.

Scikit-learn: the classical ML workhorse

Scikit-learn is built on NumPy, SciPy, and Matplotlib, and it provides a consistent API across every algorithm it implements. That consistency is the library’s most important property from an operational standpoint, because once you learn the pattern for one model, you can swap in any other without changing the surrounding code.

The API follows a three-step cycle. You instantiate a model, call fit() with training data, and call predict() on new inputs.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Swap LogisticRegression for RandomForestClassifier, SVC, or GradientBoostingClassifier, and the rest of the code stays identical. This uniformity matters when you are iterating through multiple model types during an attack, testing which architecture best replicates a target model’s behaviour or which classifier most reliably evades a detection system.

Preprocessing as attack surface

Before data reaches a model, it passes through preprocessing. Scikit-learn provides scalers, encoders, and imputers for this step, and each one is a potential point of manipulation.

StandardScaler removes the mean and scales features to unit variance. MinMaxScaler compresses values into a fixed range, typically 0 to 1. RobustScaler uses median and interquartile range, making it resistant to outliers. For a red teamer running a data poisoning attack, understanding which scaler the target pipeline uses determines how much you can shift a feature’s distribution before the perturbation gets normalised away.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Categorical encoding is equally relevant. OneHotEncoder expands categories into binary columns, and LabelEncoder maps them to integers. If you are injecting poisoned samples into a training set, knowing the encoding scheme tells you exactly which feature columns to target and what values are valid.

Missing value handling introduces another angle. SimpleImputer fills gaps with a strategy like mean, median, or most frequent value. KNNImputer uses k-nearest neighbours to estimate missing entries. An attacker who understands the imputation strategy can craft incomplete inputs that, once imputed, land in a specific region of feature space.

Model selection and evaluation

Scikit-learn’s train_test_split divides data into training and testing subsets. cross_val_score runs k-fold cross-validation, training and evaluating the model across multiple data splits for a more reliable performance estimate.

from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scores = cross_val_score(model, X, y, cv=5)

Evaluation metrics like accuracy_score, precision_score, recall_score, and f1_score quantify how well a model performs. In an adversarial context, these same metrics measure the effectiveness of your attack. If you are running a model extraction attack, the F1 score between your stolen model’s predictions and the target’s predictions tells you how faithful the copy is. If you are running an evasion attack, the drop in the target’s recall on your crafted inputs measures your success rate.

PyTorch: deep learning and gradient access

PyTorch, developed originally by Meta’s AI research team, is the framework of choice for deep learning. Where scikit-learn abstracts away the internals behind a clean fit/predict interface, PyTorch gives you direct access to the computational graph, the gradients, and every weight in the network. That low-level access is precisely what makes it the primary tool for adversarial machine learning.

Tensors and GPU acceleration

The fundamental data structure in PyTorch is the tensor, a multi-dimensional array similar to a NumPy array but with two additional capabilities that matter for offensive work. First, tensors can run on GPUs, which accelerates the kind of iterative optimisation loops that adversarial attacks require. Second, tensors track their computational history, enabling automatic gradient calculation.

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

if torch.cuda.is_available():
    x = x.to('cuda')

That requires_grad=True flag is the key. It tells PyTorch to record every operation performed on this tensor so that gradients can be computed later through backpropagation. When you craft an adversarial example against an image classifier, you set requires_grad=True on the input image, run it through the model, compute the loss with respect to a target class, and then use the resulting gradient to perturb the image in the direction that maximises misclassification. The entire attack flows from this single flag.

Building models

PyTorch provides two ways to define neural networks. The Sequential API stacks layers linearly, which works for straightforward architectures.

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
    nn.Softmax(dim=1)
)

For anything more complex, you subclass nn.Module and define the forward pass explicitly. This is the pattern you will see in most real-world models and most adversarial research code.

class CustomModel(nn.Module):
    def __init__(self):
        super(CustomModel, self).__init__()
        self.layer1 = nn.Linear(784, 128)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.relu(self.layer1(x))
        return self.layer2(x)

Understanding the forward() method matters because adversarial attacks often hook into intermediate layers. If you want to run a feature-space attack or extract internal representations from a model, you need to know where to tap in.

The training loop

Unlike scikit-learn’s single fit() call, PyTorch requires you to write the training loop explicitly. This verbosity is a feature, not a limitation, because every step of the loop is a point where an adversary can intervene.

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(10):
    for x_batch, y_batch in dataloader:
        y_pred = model(x_batch)
        loss = loss_fn(y_pred, y_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

The forward pass computes predictions. The loss function measures the error. loss.backward() computes gradients through the entire network. optimizer.step() updates the weights. A poisoning attack can manipulate the data that enters the forward pass. A backdoor attack can modify the loss function to optimise for a hidden trigger alongside the legitimate objective. A model inversion attack can repurpose the gradient computation to reconstruct training data from the model’s weights.

Data loading and model persistence

PyTorch’s Dataset and DataLoader classes manage batching and shuffling. The Dataset subclass defines how individual samples are accessed, while DataLoader wraps it with batch sizing and parallel loading.

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

dataloader = DataLoader(CustomDataset(data, labels), batch_size=32, shuffle=True)

Model saving and loading uses torch.save() and torch.load(). A saved model file (.pth) contains the learned weights, and loading it into a compatible architecture restores the model completely. From a red team perspective, if you can access a saved model file, you have the model. You can run it locally, inspect every weight, compute gradients, and craft attacks at your leisure without making a single query to the target system.

torch.save(model.state_dict(), 'model.pth')

model = CustomModel()
model.load_state_dict(torch.load('model.pth'))
model.eval()

Where the two libraries meet the adversary

IBM’s Adversarial Robustness Toolbox (ART) is worth knowing about because it wraps both scikit-learn and PyTorch models with a unified interface for running evasion, poisoning, extraction, and inference attacks. ART does not replace the need to understand the underlying libraries, but it does give you pre-built implementations of attacks like FGSM, PGD, and Carlini-Wagner that you can point at a target model and fire.

The pattern in practice tends to look like this. You use scikit-learn when the target is a classical ML system, a fraud detection model running logistic regression, a spam filter using random forests, or an anomaly detector built on isolation forests. You use PyTorch when the target is a neural network, an image classifier, a natural language model, or any system where gradient access unlocks the attack.

Knowing both libraries also matters for model extraction attacks, where you query a target system’s API, collect input-output pairs, and train a local surrogate model that approximates the target. If the target is a simple classifier, your surrogate lives in scikit-learn. If the target is a deep network, your surrogate lives in PyTorch. Either way, the quality of your extraction depends on understanding the training API well enough to iterate quickly.

Type to search