Preprocessing and splitting the dataset

A raw CSV full of network traffic tells you nothing useful until you force it into a shape a model can learn from. The NSL-KDD dataset ships with 41 features per connection record, a mix of categorical labels, continuous counters, and rate-based statistics, and every one of them needs deliberate handling before a random forest can do anything meaningful with it. Get the preprocessing wrong, leak test data into training, or encode your categories carelessly, and the model you build will look accurate on paper while failing on traffic it has never seen.

This entry walks through the full preprocessing pipeline for our random forest anomaly detection model, from target creation through encoding, feature selection, and dataset splitting.

Creating a binary classification target

The simplest way to frame network intrusion detection is as a binary problem. Is this connection normal, or is it an attack? The NSL-KDD dataset labels every record with a specific attack name or the string normal, so we collapse that into a single flag.

# Binary classification target
# Maps normal traffic to 0 and any type of attack to 1
df['attack_flag'] = df['attack'].apply(lambda a: 0 if a == 'normal' else 1)

This creates a new column attack_flag where 0 means normal traffic and 1 means any kind of attack. The lambda function checks the attack column for each row and assigns the label accordingly.

If you look at the raw data, the distinction is visible in the final fields of each record:

0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,150,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20
0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.0,1.0,0.0,0.0,0.05,0.07,0.0,255,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19

The first record is labelled normal. The second is neptune, a SYN flood denial-of-service attack. Both become machine-readable the moment we map them to 0 and 1, but the binary label throws away the distinction between a SYN flood and a password-guessing attempt. That matters when you are trying to understand what your model is actually catching.

Creating a multi-class classification target

Binary detection answers the question “is something wrong?” but it does not tell you what is wrong. A defender who knows that an alert was triggered by a probe scan responds differently to one triggered by privilege escalation, and a model that can distinguish between attack categories gives you that operational context.

We define four attack groups based on the categories in the NSL-KDD documentation:

# Multi-class classification target categories
dos_attacks = ['apache2', 'back', 'land', 'neptune', 'mailbomb', 'pod', 
               'processtable', 'smurf', 'teardrop', 'udpstorm', 'worm']
probe_attacks = ['ipsweep', 'mscan', 'nmap', 'portsweep', 'saint', 'satan']
privilege_attacks = ['buffer_overflow', 'loadmdoule', 'perl', 'ps', 
                     'rootkit', 'sqlattack', 'xterm']
access_attacks = ['ftp_write', 'guess_passwd', 'http_tunnel', 'imap', 
                  'multihop', 'named', 'phf', 'sendmail', 'snmpgetattack', 
                  'snmpguess', 'spy', 'warezclient', 'warezmaster', 
                  'xclock', 'xsnoop']
  • DoS attacks like neptune and smurf flood the target with traffic to exhaust resources
  • Probe attacks like nmap and satan scan the network to map open ports and services
  • Privilege escalation attacks like buffer_overflow and rootkit attempt to gain admin-level control after an initial foothold
  • Access attacks like guess_passwd and ftp_write try to breach access controls directly

A mapping function assigns each record an integer label based on which group its attack type falls into:

def map_attack(attack):
    if attack in dos_attacks:
        return 1
    elif attack in probe_attacks:
        return 2
    elif attack in privilege_attacks:
        return 3
    elif attack in access_attacks:
        return 4
    else:
        return 0

# Assign multi-class category to each row
df['attack_map'] = df['attack'].apply(map_attack)

Normal traffic maps to 0. DoS is 1, Probe is 2, Privilege Escalation is 3, and Access is 4. This gives the model a richer target to learn from and gives the defender a more actionable output when the model flags something.

Encoding categorical variables

Machine learning models work with numbers, not strings. Two features in the NSL-KDD dataset are categorical: protocol_type (tcp, udp, icmp) and service (http, ftp, smtp, and dozens more). These describe the nature of each network connection but they need to be converted into numeric form before a model can process them.

We use one-hot encoding through the pandas get_dummies function:

# Encoding categorical variables
features_to_encode = ['protocol_type', 'service']
encoded = pd.get_dummies(df[features_to_encode])

One-hot encoding creates a separate binary column for every unique value in the original feature. A record using TCP gets a 1 in the protocol_type_tcp column and a 0 in protocol_type_udp and protocol_type_icmp. This avoids a common mistake with simpler encoding methods like label encoding, where assigning tcp = 0, udp = 1, and icmp = 2 would imply that udp is somehow “between” tcp and icmp. One-hot encoding treats each protocol as its own independent signal, which is what we want.

The trade-off is dimensionality. The service feature alone has over 60 unique values in the full NSL-KDD dataset, so one-hot encoding turns a single column into 60+ columns. For a random forest this is manageable because tree-based models handle high-dimensional sparse features well, but it is worth being aware of the expansion.

Selecting numeric features

Beyond the categorical columns, the NSL-KDD dataset contains 34 numeric features that describe different aspects of each connection. These range from basic volume metrics like durationsrc_bytes, and dst_bytes to statistical rate features like serror_rate and dst_host_srv_diff_host_rate that capture patterns across groups of connections.

# Numeric features that capture various statistical properties of the traffic
numeric_features = [
    'duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot', 
    'num_failed_logins', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 
    'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', 
    'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 
    'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 
    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate'
]

The feature selection here is deliberate. Features like num_failed_logins and root_shell are direct indicators of suspicious behaviour, while rate features like serror_rate (the percentage of connections to the same host that produced SYN errors) capture patterns that individual connection records would miss. A single SYN error means nothing, but a serror_rate of 1.0 across 255 connections to the same host is a clear signature of a SYN flood.

The combination of raw counts and derived rates gives the random forest two different scales of information to split on, which is where tree-based models tend to perform well.

Combining features into the training set

With the encoded categorical variables and the numeric features prepared separately, we join them into a single DataFrame that the model will train on:

# Combine encoded categorical variables and numeric features
train_set = encoded.join(df[numeric_features])

# Multi-class target variable
multi_y = df['attack_map']

The train_set DataFrame now contains every feature the model needs, with categorical values expanded into binary indicators alongside the original numeric columns. The target variable multi_y holds the multi-class labels we created earlier.

Splitting the dataset

A model that only ever sees the data it was trained on will tell you nothing about how it performs in the real world. We split the dataset into three distinct subsets, each with a specific purpose, to make sure our evaluation is honest.

Training and test split

The first split reserves 20% of the data for final testing. This test set is not touched during training or tuning and exists solely to give us an unbiased measure of how the model generalises.

# Split data into training and test sets for multi-class classification
train_X, test_X, train_y, test_y = train_test_split(
    train_set, multi_y, test_size=0.2, random_state=1337
)

The random_state=1337 parameter fixes the random seed so the split is reproducible. Anyone running this code will get the same partition, which matters when you are comparing results across experiments.

Creating the validation set

We then split the remaining training data again, carving out 30% as a validation set:

# Further split the training set into separate training and validation sets
multi_train_X, multi_val_X, multi_train_y, multi_val_y = train_test_split(
    train_X, train_y, test_size=0.3, random_state=1337
)

The validation set is where we tune hyperparameters. If we used the test set for tuning, we would be optimising the model to perform well on that specific subset, and our “final” evaluation would be contaminated. The validation set absorbs that optimisation pressure so the test set remains a clean, independent measure.

What we end up with

After both splits, we have four distinct subsets:

  • multi_train_X / multi_train_y is the training subset that the model learns from directly
  • multi_val_X / multi_val_y is the validation subset used for hyperparameter tuning
  • test_X / test_y is the held-out test set for final, unbiased evaluation

The split ratios (56% train, 24% validation, 20% test) are a practical balance. Enough training data for the model to learn the patterns in each attack category, enough validation data to tune confidently, and enough test data to trust the final numbers.

Where this fits

The preprocessing decisions made here directly shape what the model can and cannot learn. If we had used label encoding instead of one-hot encoding for protocol_type, the model would treat protocol as an ordinal variable and learn spurious relationships between categories. If we had skipped the multi-class target, we would lose the ability to distinguish a SYN flood from a brute-force login attempt, which is the difference between blocking a port and locking an account. And if we had evaluated on the same data we trained on, the accuracy number would look impressive and mean nothing.

With the data now cleaned, encoded, and properly partitioned, we have a dataset that is ready for a random forest to train on. The next entry will build that model.

Leave a Reply

Your email address will not be published. Required fields are marked *

RELATED

Network anomaly detection

Train a random forest on the NSL-KDD dataset for network anomaly detection, with every data loading step examined through an…

Training and evaluating your first spam classifier

Build, tune, and evaluate a Naive Bayes spam classifier with scikit-learn, then examine what the model reveals to an adversary…

Feature extraction

How extraction builds the feature space a spam classifier learns from, and why every vocabulary decision creates an evasion path…

Preprocessing the spam dataset

Every text cleaning step in a spam classifier either blocks an evasion path or opens one. See how preprocessing shapes…