{"id":462,"date":"2026-06-01T00:00:00","date_gmt":"2026-05-31T23:00:00","guid":{"rendered":"https:\/\/kosokoking.com\/?p=462"},"modified":"2026-05-23T13:10:38","modified_gmt":"2026-05-23T12:10:38","slug":"metrics-for-evaluating-a-model","status":"publish","type":"post","link":"https:\/\/kosokoking.com\/index.php\/technology\/metrics-for-evaluating-a-model\/","title":{"rendered":"Metrics for evaluating a model"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">A malware classifier reports 99.5% accuracy and the security team signs off. Three months later, a post-incident review reveals that the model missed every novel payload variant that entered the network during that period, because 99.5% of all traffic was benign and the model had learned to predict &#8220;clean&#8221; almost unconditionally. The accuracy metric told the truth and lied at the same time. If you are building, attacking, or defending machine learning models, the evaluation metrics are where confidence is manufactured or earned. Understanding what each metric actually measures, and more importantly where each one fails, is the difference between trusting a model that works and trusting one that merely looks like it does.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What the numbers actually represent<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In earlier entries of this series, we introduced accuracy, precision, recall, and F1-score as ways to measure classifier performance. Each metric quantifies a different relationship between the model&#8217;s predictions and the known ground-truth labels, and each one has a blind spot that the others are designed to cover.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before walking through them individually, it helps to fix the vocabulary. Every prediction a binary classifier makes falls into one of four categories: true positives (correctly identified threats), true negatives (correctly cleared benign items), false positives (benign items incorrectly flagged), and false negatives (threats that slipped through). The confusion matrix is just a 2&#215;2 table of these four counts, and every metric discussed here is a different way of reading that table.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Accuracy<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Accuracy is the simplest metric. It measures the proportion of correct predictions out of all predictions made.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>accuracy = (true positives + true negatives) \/ total instances\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A model reporting accuracy of 0.9950 is correct 99.50% of the time. That sounds authoritative until you consider the base rate of the problem. In network intrusion detection, malicious traffic might represent less than 1% of all packets. A model that labels everything as benign will achieve 0.99 accuracy without detecting a single attack. The metric rewards the model for getting the easy class right and says nothing about whether it can identify the class you actually care about.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Accuracy is useful when the classes are roughly balanced and the cost of each type of error is similar. In most security applications, neither of those conditions holds, which is why accuracy alone is almost never sufficient.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Precision<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Precision answers a specific question: when the model says something is positive, how often is it right?<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>precision = true positives \/ (true positives + false positives)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A precision of 0.9949 means that when the model flags something, it is correct 99.49% of the time. In operational terms, high precision reduces the false alarm rate, and that matters because every false positive consumes analyst time. A SOC team drowning in false alerts will eventually start ignoring them, which is functionally the same as having no detection at all. Alert fatigue is one of the most documented failure modes in security operations, and precision is the metric that tracks it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The trade-off is that a model can achieve perfect precision by being extremely conservative. If it only flags the cases it is almost certain about, it will miss many real threats. Precision tells you nothing about what the model failed to catch.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Recall<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Recall measures the opposite concern: of all the actual positives in the dataset, how many did the model find?<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>recall = true positives \/ (true positives + false negatives)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A recall of 0.9950 means the model detects 99.50% of all genuine positives. In threat detection, recall is the metric that tracks whether malicious activity is slipping through. A phishing classifier with low recall means malicious emails are reaching inboxes, regardless of how clean its flagged items are.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The trade-off mirrors precision in the opposite direction. A model can achieve perfect recall by flagging everything, but doing so makes precision collapse. Every legitimate email ends up in quarantine, and the system becomes unusable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">F1-score<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The F1-score is the harmonic mean of precision and recall, which means it penalises imbalance between the two more heavily than a simple average would.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>F1 = 2 * (precision * recall) \/ (precision + recall)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">An F1-score of 0.9949 indicates that both precision and recall are strong and reasonably aligned. The harmonic mean ensures that if either metric is low, the F1-score drops sharply, so a model cannot hide behind one strong number while the other collapses.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For security classification tasks where both false positives and false negatives carry operational costs, the F1-score provides a single number that reflects whether the model is doing a balanced job. It will not tell you which direction the errors lean, so you still need to inspect precision and recall individually, but it is a useful summary when comparing models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Beyond the core four<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">These four metrics cover most classification evaluation scenarios, but several additional measures are worth knowing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Specificity<\/strong>&nbsp;measures how effectively the model identifies negatives (true negatives divided by all actual negatives). In a malware classifier, specificity tells you how often clean files are correctly cleared.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>AUC-ROC<\/strong>&nbsp;(area under the receiver operating characteristic curve) evaluates the model&#8217;s ability to distinguish between classes across all possible decision thresholds, not just the one currently set. A model with a high AUC can be tuned to favour either precision or recall depending on the operational context, which makes this metric particularly useful during model selection.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Matthews Correlation Coefficient<\/strong>&nbsp;(MCC) accounts for all four quadrants of the confusion matrix and returns a value between -1 and +1. Unlike accuracy, MCC remains informative even when classes are heavily imbalanced, because a model that predicts only the majority class will score near zero rather than near one.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where metrics deceive<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When evaluating a model&#8217;s metrics (for example, accuracy 0.9750, precision 0.9300, recall 0.9100, F1-score 0.9200), the numbers themselves are only meaningful in context. Several questions determine whether those values reflect genuine performance or favourable test conditions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, consider whether the dataset reflects operational reality. A malware classifier trained and tested on a dataset where 50% of samples are malicious will report very different metrics than one evaluated against real-world traffic where the malicious proportion is closer to 0.1%. If the test set does not match the deployment distribution, the metrics are measuring performance against a problem the model will never actually face.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Second, consider whether the metrics are consistent across segments of the data. A model might achieve strong aggregate precision while performing poorly on specific malware families or attack techniques. Aggregate numbers can mask localised failures, and those failures are often exactly where an adversary will probe.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Third, consider the asymmetric cost of errors. In threat detection, a missed intrusion (false negative) can result in data exfiltration or ransomware deployment, while a false alarm (false positive) costs analyst time. These costs are not equal, and a model should be evaluated against the cost structure of the environment it will operate in. Some contexts call for favouring recall over precision, accepting more false alarms to avoid missing anything. Others demand precision because the analyst team is small and every false alarm diverts resources from real incidents.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The adversarial angle<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For anyone in this series studying how models can be attacked, metrics reveal specific targets. If you know a model was optimised for precision, you know the developers prioritised reducing false positives, which likely means the decision boundary is conservative. An adversary can exploit this by crafting inputs that sit just below the model&#8217;s confidence threshold, samples that are technically malicious but sufficiently similar to benign data that the model declines to flag them. The model&#8217;s precision stays high because it never flagged the adversarial input, and its recall quietly degrades.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Conversely, a model tuned for high recall will flag aggressively, and an adversary can weaponise the resulting alert volume. By generating a large number of borderline inputs that trigger false positives, the attacker can bury genuine malicious activity in noise, exploiting the operational consequence of alert fatigue rather than any weakness in the model&#8217;s architecture.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Understanding what a defender optimised for tells you where the gaps are.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Reading the whole picture<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">No single metric captures model performance completely, and the temptation to report whichever number looks strongest is one of the quieter problems in ML evaluation. The confusion matrix is the most honest representation because it shows exactly where the model is right and where it is wrong, without collapsing those details into a single scalar. Every derived metric is a lossy compression of that matrix, and each one throws away the information that makes a different metric useful.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn how accuracy, precision, recall, and F1-score work in practice, where each metrics deceive, and how adversaries exploit the gaps they leave behind.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[640,630,131,746,747,51,749,136,745,748],"class_list":["post-462","post","type-post","status-publish","format-standard","hentry","category-technology","tag-adversarial-ai","tag-ai-red-teaming","tag-artificial-intelligence","tag-classification-metrics","tag-confusion-matrix","tag-cybersecurity","tag-data-science","tag-machine-learning","tag-model-evaluation","tag-precision-and-recall"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/462","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/comments?post=462"}],"version-history":[{"count":2,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/462\/revisions"}],"predecessor-version":[{"id":465,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/462\/revisions\/465"}],"wp:attachment":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/media?parent=462"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/categories?post=462"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/tags?post=462"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}