{"id":480,"date":"2026-06-09T00:00:00","date_gmt":"2026-06-08T23:00:00","guid":{"rendered":"https:\/\/kosokoking.com\/?p=480"},"modified":"2026-05-26T20:16:01","modified_gmt":"2026-05-26T19:16:01","slug":"preprocessing-and-splitting-the-dataset","status":"publish","type":"post","link":"https:\/\/kosokoking.com\/index.php\/technology\/preprocessing-and-splitting-the-dataset\/","title":{"rendered":"Preprocessing and splitting the dataset"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">A raw CSV full of network traffic tells you nothing useful until you force it into a shape a model can learn from. The NSL-KDD dataset ships with 41 features per connection record, a mix of categorical labels, continuous counters, and rate-based statistics, and every one of them needs deliberate handling before a random forest can do anything meaningful with it. Get the preprocessing wrong, leak test data into training, or encode your categories carelessly, and the model you build will look accurate on paper while failing on traffic it has never seen.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This entry walks through the full preprocessing pipeline for our random forest anomaly detection model, from target creation through encoding, feature selection, and dataset splitting.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Creating a binary classification target<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The simplest way to frame network intrusion detection is as a binary problem. Is this connection normal, or is it an attack? The NSL-KDD dataset labels every record with a specific attack name or the string&nbsp;<code>normal<\/code>, so we collapse that into a single flag.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Binary classification target\n# Maps normal traffic to 0 and any type of attack to 1\ndf&#91;'attack_flag'] = df&#91;'attack'].apply(lambda a: 0 if a == 'normal' else 1)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This creates a new column&nbsp;<code>attack_flag<\/code>&nbsp;where 0 means normal traffic and 1 means any kind of attack. The lambda function checks the&nbsp;<code>attack<\/code>&nbsp;column for each row and assigns the label accordingly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you look at the raw data, the distinction is visible in the final fields of each record:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,150,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20\n0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.0,1.0,0.0,0.0,0.05,0.07,0.0,255,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The first record is labelled&nbsp;<code>normal<\/code>. The second is&nbsp;<code>neptune<\/code>, a SYN flood denial-of-service attack. Both become machine-readable the moment we map them to 0 and 1, but the binary label throws away the distinction between a SYN flood and a password-guessing attempt. That matters when you are trying to understand what your model is actually catching.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Creating a multi-class classification target<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Binary detection answers the question &#8220;is something wrong?&#8221; but it does not tell you what is wrong. A defender who knows that an alert was triggered by a probe scan responds differently to one triggered by privilege escalation, and a model that can distinguish between attack categories gives you that operational context.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We define four attack groups based on the categories in the NSL-KDD documentation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Multi-class classification target categories\ndos_attacks = &#91;'apache2', 'back', 'land', 'neptune', 'mailbomb', 'pod', \n               'processtable', 'smurf', 'teardrop', 'udpstorm', 'worm']\nprobe_attacks = &#91;'ipsweep', 'mscan', 'nmap', 'portsweep', 'saint', 'satan']\nprivilege_attacks = &#91;'buffer_overflow', 'loadmdoule', 'perl', 'ps', \n                     'rootkit', 'sqlattack', 'xterm']\naccess_attacks = &#91;'ftp_write', 'guess_passwd', 'http_tunnel', 'imap', \n                  'multihop', 'named', 'phf', 'sendmail', 'snmpgetattack', \n                  'snmpguess', 'spy', 'warezclient', 'warezmaster', \n                  'xclock', 'xsnoop']\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>DoS<\/strong>\u00a0attacks like\u00a0<code>neptune<\/code>\u00a0and\u00a0<code>smurf<\/code>\u00a0flood the target with traffic to exhaust resources<\/li>\n\n\n\n<li><strong>Probe<\/strong>\u00a0attacks like\u00a0<code>nmap<\/code>\u00a0and\u00a0<code>satan<\/code>\u00a0scan the network to map open ports and services<\/li>\n\n\n\n<li><strong>Privilege escalation<\/strong>\u00a0attacks like\u00a0<code>buffer_overflow<\/code>\u00a0and\u00a0<code>rootkit<\/code>\u00a0attempt to gain admin-level control after an initial foothold<\/li>\n\n\n\n<li><strong>Access<\/strong>\u00a0attacks like\u00a0<code>guess_passwd<\/code>\u00a0and\u00a0<code>ftp_write<\/code>\u00a0try to breach access controls directly<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">A mapping function assigns each record an integer label based on which group its attack type falls into:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def map_attack(attack):\n    if attack in dos_attacks:\n        return 1\n    elif attack in probe_attacks:\n        return 2\n    elif attack in privilege_attacks:\n        return 3\n    elif attack in access_attacks:\n        return 4\n    else:\n        return 0\n\n# Assign multi-class category to each row\ndf&#91;'attack_map'] = df&#91;'attack'].apply(map_attack)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Normal traffic maps to 0. DoS is 1, Probe is 2, Privilege Escalation is 3, and Access is 4. This gives the model a richer target to learn from and gives the defender a more actionable output when the model flags something.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Encoding categorical variables<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Machine learning models work with numbers, not strings. Two features in the NSL-KDD dataset are categorical:&nbsp;<code>protocol_type<\/code>&nbsp;(tcp, udp, icmp) and&nbsp;<code>service<\/code>&nbsp;(http, ftp, smtp, and dozens more). These describe the nature of each network connection but they need to be converted into numeric form before a model can process them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We use one-hot encoding through the pandas&nbsp;<code>get_dummies<\/code>&nbsp;function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Encoding categorical variables\nfeatures_to_encode = &#91;'protocol_type', 'service']\nencoded = pd.get_dummies(df&#91;features_to_encode])\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">One-hot encoding creates a separate binary column for every unique value in the original feature. A record using TCP gets a 1 in the&nbsp;<code>protocol_type_tcp<\/code>&nbsp;column and a 0 in&nbsp;<code>protocol_type_udp<\/code>&nbsp;and&nbsp;<code>protocol_type_icmp<\/code>. This avoids a common mistake with simpler encoding methods like label encoding, where assigning tcp = 0, udp = 1, and icmp = 2 would imply that udp is somehow &#8220;between&#8221; tcp and icmp. One-hot encoding treats each protocol as its own independent signal, which is what we want.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The trade-off is dimensionality. The&nbsp;<code>service<\/code>&nbsp;feature alone has over 60 unique values in the full NSL-KDD dataset, so one-hot encoding turns a single column into 60+ columns. For a random forest this is manageable because tree-based models handle high-dimensional sparse features well, but it is worth being aware of the expansion.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Selecting numeric features<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Beyond the categorical columns, the NSL-KDD dataset contains 34 numeric features that describe different aspects of each connection. These range from basic volume metrics like&nbsp;<code>duration<\/code>,&nbsp;<code>src_bytes<\/code>, and&nbsp;<code>dst_bytes<\/code>&nbsp;to statistical rate features like&nbsp;<code>serror_rate<\/code>&nbsp;and&nbsp;<code>dst_host_srv_diff_host_rate<\/code>&nbsp;that capture patterns across groups of connections.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Numeric features that capture various statistical properties of the traffic\nnumeric_features = &#91;\n    'duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot', \n    'num_failed_logins', 'num_compromised', 'root_shell', 'su_attempted', \n    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', \n    'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', \n    'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', \n    'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', \n    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', \n    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', \n    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', \n    'dst_host_srv_rerror_rate'\n]\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The feature selection here is deliberate. Features like&nbsp;<code>num_failed_logins<\/code>&nbsp;and&nbsp;<code>root_shell<\/code>&nbsp;are direct indicators of suspicious behaviour, while rate features like&nbsp;<code>serror_rate<\/code>&nbsp;(the percentage of connections to the same host that produced SYN errors) capture patterns that individual connection records would miss. A single SYN error means nothing, but a&nbsp;<code>serror_rate<\/code>&nbsp;of 1.0 across 255 connections to the same host is a clear signature of a SYN flood.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The combination of raw counts and derived rates gives the random forest two different scales of information to split on, which is where tree-based models tend to perform well.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Combining features into the training set<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">With the encoded categorical variables and the numeric features prepared separately, we join them into a single DataFrame that the model will train on:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Combine encoded categorical variables and numeric features\ntrain_set = encoded.join(df&#91;numeric_features])\n\n# Multi-class target variable\nmulti_y = df&#91;'attack_map']\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<code>train_set<\/code>&nbsp;DataFrame now contains every feature the model needs, with categorical values expanded into binary indicators alongside the original numeric columns. The target variable&nbsp;<code>multi_y<\/code>&nbsp;holds the multi-class labels we created earlier.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Splitting the dataset<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A model that only ever sees the data it was trained on will tell you nothing about how it performs in the real world. We split the dataset into three distinct subsets, each with a specific purpose, to make sure our evaluation is honest.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Training and test split<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The first split reserves 20% of the data for final testing. This test set is not touched during training or tuning and exists solely to give us an unbiased measure of how the model generalises.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Split data into training and test sets for multi-class classification\ntrain_X, test_X, train_y, test_y = train_test_split(\n    train_set, multi_y, test_size=0.2, random_state=1337\n)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<code>random_state=1337<\/code>&nbsp;parameter fixes the random seed so the split is reproducible. Anyone running this code will get the same partition, which matters when you are comparing results across experiments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Creating the validation set<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">We then split the remaining training data again, carving out 30% as a validation set:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Further split the training set into separate training and validation sets\nmulti_train_X, multi_val_X, multi_train_y, multi_val_y = train_test_split(\n    train_X, train_y, test_size=0.3, random_state=1337\n)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The validation set is where we tune hyperparameters. If we used the test set for tuning, we would be optimising the model to perform well on that specific subset, and our &#8220;final&#8221; evaluation would be contaminated. The validation set absorbs that optimisation pressure so the test set remains a clean, independent measure.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What we end up with<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">After both splits, we have four distinct subsets:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>multi_train_X<\/code>\u00a0\/\u00a0<code>multi_train_y<\/code>\u00a0is the training subset that the model learns from directly<\/li>\n\n\n\n<li><code>multi_val_X<\/code>\u00a0\/\u00a0<code>multi_val_y<\/code>\u00a0is the validation subset used for hyperparameter tuning<\/li>\n\n\n\n<li><code>test_X<\/code>\u00a0\/\u00a0<code>test_y<\/code>\u00a0is the held-out test set for final, unbiased evaluation<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The split ratios (56% train, 24% validation, 20% test) are a practical balance. Enough training data for the model to learn the patterns in each attack category, enough validation data to tune confidently, and enough test data to trust the final numbers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where this fits<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The preprocessing decisions made here directly shape what the model can and cannot learn. If we had used label encoding instead of one-hot encoding for&nbsp;<code>protocol_type<\/code>, the model would treat protocol as an ordinal variable and learn spurious relationships between categories. If we had skipped the multi-class target, we would lose the ability to distinguish a SYN flood from a brute-force login attempt, which is the difference between blocking a port and locking an account. And if we had evaluated on the same data we trained on, the accuracy number would look impressive and mean nothing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With the data now cleaned, encoded, and properly partitioned, we have a dataset that is ready for a random forest to train on. The next entry will build that model.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Preparing the NSL-KDD dataset for random forest anomaly detection, from binary and multi-class targets to encoding, feature selection, and honest splitting.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[630,626,51,740,669,136,764,763,726,762],"class_list":["post-480","post","type-post","status-publish","format-standard","hentry","category-technology","tag-ai-red-teaming","tag-anomaly-detection","tag-cybersecurity","tag-data-preprocessing","tag-feature-engineering","tag-machine-learning","tag-network-intrusion-detection","tag-nsl-kdd","tag-python","tag-random-forest"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/480","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/comments?post=480"}],"version-history":[{"count":1,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/480\/revisions"}],"predecessor-version":[{"id":481,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/480\/revisions\/481"}],"wp:attachment":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/media?parent=480"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/categories?post=480"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/tags?post=480"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}