{"id":468,"date":"2026-06-03T00:00:00","date_gmt":"2026-06-02T23:00:00","guid":{"rendered":"https:\/\/kosokoking.com\/?p=468"},"modified":"2026-05-24T17:04:55","modified_gmt":"2026-05-24T16:04:55","slug":"bayesian-spam-classification-the-dataset","status":"publish","type":"post","link":"https:\/\/kosokoking.com\/index.php\/technology\/bayesian-spam-classification-the-dataset\/","title":{"rendered":"Bayesian spam classification: the dataset"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Every classifier is only as honest as the data it learned from. Feed a Bayesian model a corpus full of duplicates, unlabelled noise, or messages scraped from a single demographic, and the probabilities it computes will be confident, consistent, and wrong. Before we write a single line of classification logic, we need a dataset we can trust, and we need to verify that trust with code rather than assumptions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this entry, we will work with the <a href=\"https:\/\/www.kaggle.com\/datasets\/uciml\/sms-spam-collection-dataset\" title=\"\">SMS Spam Collection dataset,<\/a> a labelled corpus of text messages that gives us a clean, well-understood foundation for building a Naive Bayes spam classifier. The goal is to understand what we are loading, where it came from, and what condition it is in before we let a model anywhere near it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The SMS Spam Collection<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The SMS Spam Collection dataset was assembled by Tiago A. Almeida and Akebo Yamakami at the University of Campinas in Brazil, alongside Jose Maria Gomez Hidalgo at the R&amp;D department of Optenet in Spain. Their paper, &#8220;<a href=\"https:\/\/www.researchgate.net\/publication\/221353226_Contributions_to_the_study_of_SMS_spam_filtering_new_collection_and_results\" title=\"\">Contributions to the Study of SMS Spam Filtering: New Collection and Results<\/a>,&#8221; was presented at the 2011 ACM Symposium on Document Engineering, and it addressed a gap that mattered at the time. Most spam filtering research focused on email, but SMS spam was a growing problem with different characteristics, shorter messages, different language patterns, and a delivery mechanism that bypassed traditional mail filters entirely.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The authors pulled messages from multiple sources, including the Grumbletext website, the NUS SMS Corpus, and <a href=\"http:\/\/etheses.bham.ac.uk\/253\/1\/Tagg09PhD.pdf\" title=\"\">Caroline Tag&#8217;s PhD thesis<\/a>. The resulting corpus contains 5,574 text messages, each annotated as either\u00a0<strong>ham<\/strong>(legitimate) or\u00a0<strong>spam<\/strong>\u00a0(unwanted). In this context, ham covers messages from known contacts, subscriptions, or newsletters that hold value for the recipient, while spam represents unsolicited content that typically offers no benefit and may pose risks to the user.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For our purposes, this dataset is useful precisely because it is small, well-labelled, and widely benchmarked. We are not trying to build a production spam filter. We are trying to understand how Bayesian classification works at a mechanical level, and a clean, manageable corpus lets us focus on the algorithm rather than wrestling with data engineering problems.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Downloading the dataset<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The first step is to pull the dataset programmatically. We will download it directly from the UCI Machine Learning Repository, which hosts the original archive.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import requests\nimport zipfile\nimport io\n\n# URL of the dataset\nurl = \"https:\/\/archive.ics.uci.edu\/static\/public\/228\/sms+spam+collection.zip\"\n\n# Download the dataset\nresponse = requests.get(url, verify=False)\nif response.status_code == 200:\n    print(\"Download successful\")\nelse:\n    print(\"Failed to download the dataset\")\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We use the&nbsp;<strong>requests<\/strong>&nbsp;library to send an HTTP GET request to the dataset URL and check the status code to confirm the download succeeded. A status code of 200 means the server returned the file without errors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once the download completes, we extract the contents. The dataset ships as a&nbsp;<code>.zip<\/code>&nbsp;archive, and Python&#8217;s&nbsp;<strong>zipfile<\/strong>&nbsp;and&nbsp;<strong>io<\/strong>libraries handle decompression without writing a temporary file to disk.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Extract the dataset\nwith zipfile.ZipFile(io.BytesIO(response.content)) as z:\n    z.extractall(\"sms_spam_collection\")\n    print(\"Extraction successful\")\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here,&nbsp;<code>response.content<\/code>&nbsp;holds the raw binary data of the downloaded archive. We wrap it in&nbsp;<code>io.BytesIO<\/code>&nbsp;to create an in-memory file-like object that&nbsp;<code>zipfile.ZipFile<\/code>&nbsp;can read directly, and&nbsp;<code>extractall<\/code>&nbsp;writes the contents into a local directory called&nbsp;<code>sms_spam_collection<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To confirm the extraction worked and see what files we are dealing with, we list the directory contents.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import os\n\n# List the extracted files\nextracted_files = os.listdir(\"sms_spam_collection\")\nprint(\"Extracted files:\", extracted_files)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<code>os.listdir<\/code>&nbsp;function returns the names of all files and directories at the specified path, which lets us verify that the&nbsp;<code>SMSSpamCollection<\/code>&nbsp;file is present and ready to load.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Loading the dataset<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">With the archive extracted, we can load the data into a&nbsp;<strong>pandas<\/strong>&nbsp;DataFrame for inspection and analysis. The SMS Spam Collection is stored as a tab-separated values file, which means each row contains a label and a message separated by a tab character.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# Load the dataset\ndf = pd.read_csv(\n    \"sms_spam_collection\/SMSSpamCollection\",\n    sep=\"\\t\",\n    header=None,\n    names=&#91;\"label\", \"message\"],\n)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We specify&nbsp;<code>sep=\"\\t\"<\/code>&nbsp;to tell pandas that columns are delimited by tabs rather than commas. Since the file contains no header row, we set&nbsp;<code>header=None<\/code>&nbsp;and provide column names manually through the&nbsp;<code>names<\/code>&nbsp;parameter, giving us a clean two-column DataFrame with&nbsp;<code>label<\/code>&nbsp;and&nbsp;<code>message<\/code>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Inspecting the data<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before doing anything with this data, we need to understand its shape, its types, and whether it contains any problems that would corrupt downstream analysis. This is not a formality. Dirty data produces misleading probability distributions, and in a Bayesian classifier, every probability matters because the entire prediction depends on them.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Display basic information about the dataset\nprint(\"-------------------- HEAD --------------------\")\nprint(df.head())\nprint(\"-------------------- DESCRIBE --------------------\")\nprint(df.describe())\nprint(\"-------------------- INFO --------------------\")\nprint(df.info())\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><code>df.head()<\/code>&nbsp;shows us the first few rows so we can visually confirm that labels and messages loaded into the correct columns.&nbsp;<code>df.describe()<\/code>&nbsp;provides a statistical summary, which for a text-based dataset tells us the number of unique labels and the most common one.&nbsp;<code>df.info()<\/code>&nbsp;gives a concise overview of column data types and non-null counts, letting us spot structural issues before they surface as bugs further down the line.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we check for missing values. A null entry in the label column would mean an unlabelled message slipping into the training data, and a null message would mean the classifier trying to compute word probabilities from nothing.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Check for missing values\nprint(\"Missing values:\\n\", df.isnull().sum())\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<code>isnull<\/code>&nbsp;method returns a boolean DataFrame of the same shape as the original, marking&nbsp;<code>True<\/code>&nbsp;wherever a value is missing. Calling&nbsp;<code>sum<\/code>&nbsp;on that result counts the missing entries per column, giving us a clear picture of data completeness.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we handle duplicates. Duplicate messages are a real problem for a Bayesian classifier because they artificially inflate the frequency of certain word patterns, which skews the conditional probabilities the model relies on for prediction.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Check for duplicates\nprint(\"Duplicate entries:\", df.duplicated().sum())\n\n# Remove duplicates if any\ndf = df.drop_duplicates()\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<code>duplicated<\/code>&nbsp;method returns a boolean Series flagging rows that appear more than once, and&nbsp;<code>sum<\/code>&nbsp;counts them. We then use&nbsp;<code>drop_duplicates<\/code>&nbsp;to remove the extra copies, keeping only the first occurrence of each unique message.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What we have now<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">At this point, we have a clean, deduplicated DataFrame with two columns, a label indicating ham or spam, and the raw message text. The dataset is small enough to work with interactively but large enough to produce meaningful probability estimates when we build the classifier.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The work we have done here might look routine, but it is the foundation everything else rests on. When we compute prior probabilities and likelihoods in the next entry, those calculations assume the training data is complete, correctly labelled, and free of duplicates that would distort word frequencies. If any of those assumptions were wrong, we would not get an error message. We would get a classifier that looked perfectly functional but made quietly incorrect predictions, and that is the kind of failure that matters most in adversarial contexts, where an attacker is actively trying to find the gap between what your model thinks and what is actually true.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Preparing the SMS Spam Collection dataset for Bayesian classification, covering download, extraction, loading, and cleaning through an adversarial lens.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[630,752,51,740,753,136,650,751,726,653],"class_list":["post-468","post","type-post","status-publish","format-standard","hentry","category-technology","tag-ai-red-teaming","tag-bayesian-classification","tag-cybersecurity","tag-data-preprocessing","tag-dataset-preparation","tag-machine-learning","tag-naive-bayes","tag-natural-language-processing","tag-python","tag-spam-filtering"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/468","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/comments?post=468"}],"version-history":[{"count":2,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/468\/revisions"}],"predecessor-version":[{"id":470,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/468\/revisions\/470"}],"wp:attachment":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/media?parent=468"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/categories?post=468"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/tags?post=468"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}