Training and evaluation

A random forest trained on the NSL-KDD dataset scores 0.99 accuracy on validation data. The confusion matrix tells you something more useful: it misclassifies privilege escalation attempts as normal traffic at a rate that would make any red teamer smile. Accuracy is the metric organisations put in slide decks. The confusion matrix is the metric attackers actually use. The previous entries in this series covered the preprocessing and feature engineering stages of building a machine learning model. This entry covers what comes next: training the model and evaluating its performance. For a defender, evaluation answers the question “how good is this model?” For a red teamer, evaluation answers a different question entirely, which is “where exactly does this model fail, and can I make it fail there on purpose?”

We are training a random forest classifier on the NSL-KDD dataset for multi-class network anomaly detection. The model needs to distinguish between normal traffic and four attack categories: denial of service (DoS), network probing (Probe), privilege escalation (Privilege), and unauthorised remote access (Access). These map to the original NSL-KDD labels of Normal, DoS, Probe, U2R, and R2L respectively.

Training the random forest

The training step itself is straightforward:

# Train RandomForest model for multi-class classification
rf_model_multi = RandomForestClassifier(random_state=1337)
rf_model_multi.fit(multi_train_X, multi_train_y)

RandomForestClassifier with default parameters builds 100 decision trees, each trained on a bootstrapped sample of the training data. Setting random_state=1337 makes the process reproducible, which matters for red teaming because reproducibility means you can replay the exact training process when testing adversarial inputs later.

The adversarial perspective starts with the default parameters. No max_depth limit lets trees grow until every leaf is pure or contains fewer than two samples. No min_samples_leaf constraint means the model can memorise rare patterns in the training data. For a defender, deep trees mean high accuracy on the training distribution. For an attacker, deep trees mean the model has overfit to the specific feature distributions it saw during training, and anything that deviates from those distributions in a controlled way has a chance of evading classification.

Evaluation metrics and what they actually tell you

After training, we evaluate on the validation set:

# Predict and evaluate the model on the validation set
multi_predictions = rf_model_multi.predict(multi_val_X)
accuracy = accuracy_score(multi_val_y, multi_predictions)
precision = precision_score(multi_val_y, multi_predictions, average='weighted')
recall = recall_score(multi_val_y, multi_predictions, average='weighted')
f1 = f1_score(multi_val_y, multi_predictions, average='weighted')
print(f"Validation Set Evaluation:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

Four metrics appear here, and each one tells a different story about the model’s weaknesses.

Accuracy is the proportion of all predictions that were correct. On imbalanced datasets like NSL-KDD, where normal traffic vastly outnumbers privilege escalation attempts, accuracy is misleading. A model that labels everything as “Normal” would still score well on accuracy because the majority class dominates the denominator.

Precision measures how many of the samples the model flagged as a given class actually belonged to that class. Low precision for DoS means the model generates false alarms, flagging legitimate traffic as attacks. For a defender, false alarms erode trust in the system. For an attacker, a high false-positive rate is useful because it trains the SOC team to ignore alerts.

Recall measures how many of the actual samples in a class the model correctly identified. Low recall for privilege escalation means the model misses real U2R attacks. This is the metric a red teamer cares about most, because low recall on a specific class means that class is where evasion is easiest.

F1-score is the harmonic mean of precision and recall. It penalises models that sacrifice one for the other. A model with perfect precision but terrible recall (it only flags attacks it is absolutely sure about, and misses everything else) will have a low F1. The average='weighted' parameter weights each class by its support count, which means rare classes like U2R contribute less to the overall score, and their weaknesses get buried in the aggregate number.

The confusion matrix

The aggregate metrics compress all of this into single numbers. The confusion matrix preserves the detail that matters.

# Confusion matrix for validation set
conf_matrix = confusion_matrix(multi_val_y, multi_predictions)
class_labels = ['Normal', 'DoS', 'Probe', 'Privilege', 'Access']
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_labels,
            yticklabels=class_labels)
plt.title('Network Anomaly Detection - Validation Set')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

The confusion matrix is a grid where each row represents the actual class and each column represents the predicted class. The diagonal shows correct predictions. Everything off the diagonal is a mistake, and every mistake has a direction.

The validation set confusion matrix from the NSL-KDD dataset reveals a familiar pattern in network anomaly detection. The model handles Normal, DoS, and Probe traffic well, because those classes have thousands of training examples. Privilege and Access classes, which correspond to U2R and R2L attacks, are where performance degrades. These are the rarest attack types in the dataset, and the model has seen too few examples to generalise reliably.

From a red teaming perspective, this is a map. A privilege escalation attempt that the model classifies as normal traffic is a false negative, and the confusion matrix tells you how often that happens. If the model confuses U2R with normal traffic at a meaningful rate, that confusion is an evasion path. You do not need to craft sophisticated adversarial perturbations if the model already struggles to recognise the attack category under clean conditions.

The classification report adds per-class granularity:

# Classification report for validation set
print("Classification Report for Validation Set:")
print(classification_report(multi_val_y, multi_predictions, target_names=class_labels))

This report breaks precision, recall, and F1 out for each individual class, along with the support count (how many samples of that class appeared in the evaluation set). When the support for U2R is in the low hundreds while DoS has tens of thousands, you can see exactly why the aggregate metrics look healthy while specific attack categories go underdetected.

Test set evaluation

The validation set is drawn from the same distribution as the training data. The test set in NSL-KDD is deliberately harder, containing attack variants and traffic patterns not present in the training split. This is where we find out whether the model has learned genuine patterns or just memorised the training distribution:

# Final evaluation on the test set
test_multi_predictions = rf_model_multi.predict(test_X)
test_accuracy = accuracy_score(test_y, test_multi_predictions)
test_precision = precision_score(test_y, test_multi_predictions, average='weighted')
test_recall = recall_score(test_y, test_multi_predictions, average='weighted')
test_f1 = f1_score(test_y, test_multi_predictions, average='weighted')
print("\nTest Set Evaluation:")
print(f"Accuracy: {test_accuracy:.4f}")
print(f"Precision: {test_precision:.4f}")
print(f"Recall: {test_recall:.4f}")
print(f"F1-Score: {test_f1:.4f}")

# Confusion matrix for test set
test_conf_matrix = confusion_matrix(test_y, test_multi_predictions)
sns.heatmap(test_conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_labels,
            yticklabels=class_labels)
plt.title('Network Anomaly Detection')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Classification report for test set
print("Classification Report for Test Set:")
print(classification_report(test_y, test_multi_predictions, target_names=class_labels))

The gap between validation and test performance is the number a red teamer should focus on. If accuracy drops significantly on the test set, the model is brittle, and brittleness against distribution shift is exactly what adversarial examples exploit. The test set in NSL-KDD contains newer attack variants specifically designed to test whether classifiers generalise beyond their training data, which makes the performance gap a direct measure of how well the model would handle novel evasion techniques.

Compare the two confusion matrices side by side. Where the test set performance degrades relative to validation, you are looking at the exact attack categories where the model’s learned representations are weakest. Those are the categories where crafted traffic has the highest probability of bypassing detection.

Saving the model and the risks of serialisation

The final step is persisting the trained model to disk:

import joblib

# Save the trained model to a file
model_filename = 'network_anomaly_detection_model.joblib'
joblib.dump(rf_model_multi, model_filename)
print(f"Model saved to {model_filename}")

joblib.dump serialises the entire random forest, all 100 trees with their split thresholds, feature indices, and leaf values, into a single file. This is convenient. It is also a security decision that most ML pipelines make without thinking about it.

Joblib uses Python’s pickle protocol under the hood, and pickle deserialisation is one of the most well-documented code execution vectors in the Python ecosystem. When you call joblib.load() to reload a model, pickle reconstructs the Python objects stored in the file, including any custom __reduce__ methods an attacker might have injected. A 2025 CCS paper by researchers demonstrated that malicious pickle-based models on platforms like Hugging Face could execute arbitrary code during loading, with payloads including system fingerprinting, credential theft, and reverse shells. A subsequent study found that roughly half of popular Hugging Face repositories still contained pickle-format models with no safetensors alternative.

For a red teamer, a serialised model file is a target in two ways. First, it contains the full decision logic of the classifier, which means that anyone who obtains the .joblib file can inspect every split threshold and reconstruct the exact decision boundary, making white-box evasion trivial. Second, if an attacker can replace the model file in the pipeline with a tampered version, they gain code execution on whatever system loads it.

The practical defence is to treat model files like executables. Verify integrity with cryptographic hashes before loading. Restrict filesystem permissions so the model file cannot be written by unprivileged processes. If you are loading models from external sources, use formats like safetensors that do not permit arbitrary code execution during deserialisation.

What the evaluation tells you if you know how to read it

Training and evaluating a model is the point at which you learn what it can and cannot see. The aggregate metrics will almost always look good on a well-structured dataset like NSL-KDD, because the majority classes dominate the numbers. The confusion matrix reveals the gaps, and those gaps are concentrated in the classes with the fewest training examples and the most operational significance. Privilege escalation and remote access attacks are rare in the dataset and dangerous in production, which means the model is weakest precisely where it matters most.

Type to search