{"id":390,"date":"2026-04-16T00:00:00","date_gmt":"2026-04-15T23:00:00","guid":{"rendered":"https:\/\/kosokoking.com\/?p=390"},"modified":"2026-04-12T15:25:59","modified_gmt":"2026-04-12T14:25:59","slug":"linear-regression","status":"publish","type":"post","link":"https:\/\/kosokoking.com\/index.php\/multifarious\/linear-regression\/","title":{"rendered":"Linear regression"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">A model that assumes the world moves in straight lines is a model that can be steered. Linear regression is the simplest supervised learning algorithm in production, and it is everywhere. It is in anomaly scoring in SIEMs, baseline prediction in fraud detection, trend estimation in threat intelligence feeds. It is also, by virtue of its transparency, the easiest model to manipulate if you understand how it learns.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before you can poison a training pipeline or craft adversarial inputs against a machine learning system, you need to understand what the model is actually doing with the data it receives. Linear regression is where that understanding starts.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What regression actually means<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Regression is a category of supervised learning where the output is a continuous number, not a label. The model is not deciding &#8220;malicious or benign.&#8221; It is estimating a value: how much, how many, how likely on a sliding scale.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In security contexts, regression models show up in places most analysts never inspect. A SIEM that scores alerts by predicted severity is often running a regression model underneath. A fraud system that estimates transaction risk as a percentage is doing regression. Network baseline tools that predict expected traffic volume for a given time window, then flag deviations, are regression. The output is not a binary classification. It is a number on a range, and that number drives downstream decisions: which alerts surface, which transactions get held, which network segments get scrutinised.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The distinction matters for red teaming because the attack surface is different. With classification models, the goal is usually to flip a label (make &#8220;malicious&#8221; read as &#8220;benign&#8221;). With regression models, the goal is to shift a number: push a risk score below a threshold, nudge an anomaly prediction into the expected range, or inflate a baseline so that genuinely anomalous behaviour looks normal by comparison.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Simple linear regression<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In its simplest form, linear regression models the relationship between one input variable and one output variable as a straight line:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>y = mx + c\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Where&nbsp;<code>y<\/code>&nbsp;is the predicted output,&nbsp;<code>x<\/code>&nbsp;is the input,&nbsp;<code>m<\/code>&nbsp;is the slope (how much&nbsp;<code>y<\/code>&nbsp;changes for each unit change in&nbsp;<code>x<\/code>), and&nbsp;<code>c<\/code>&nbsp;is the y-intercept (the value of&nbsp;<code>y<\/code>&nbsp;when&nbsp;<code>x<\/code>&nbsp;is zero).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The model&#8217;s entire job is to find the values of&nbsp;<code>m<\/code>&nbsp;and&nbsp;<code>c<\/code>&nbsp;that produce the best-fitting line through the training data. &#8220;Best-fitting&#8221; means the line that minimises the total error between what the model predicts and what actually happened.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A concrete example: suppose a network baseline model predicts expected bytes transferred per hour based on the hour of the day. The model learns that at 3am, traffic is typically low; at 10am, it peaks. The slope and intercept encode that pattern. Anything that deviates significantly from the predicted value gets flagged.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For a red teamer, the interesting question is: what happens if you can influence the training data? If you generate synthetic traffic at 3am for long enough, the model&#8217;s slope shifts. The baseline adjusts upward. Your actual exfiltration at 3am now falls within the &#8220;expected&#8221; range. The alert never fires.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Multiple linear regression<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When the model uses more than one input variable, the equation extends:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>y = b0 + b1*x1 + b2*x2 + ... + bn*xn\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Each coefficient (<code>b1<\/code>,&nbsp;<code>b2<\/code>, etc.) represents how much each input variable contributes to the prediction, and&nbsp;<code>b0<\/code>&nbsp;is the intercept.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most production models use multiple inputs. A fraud detection regression might take transaction amount, time of day, geographic distance from last transaction, and merchant category as inputs, then output a risk score. Each coefficient tells you how heavily the model weighs that factor.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is operationally useful to an attacker because the coefficients are interpretable. Unlike a neural network where the decision logic is opaque, a linear regression model&#8217;s weights are readable. If you can extract or infer the coefficients, you know exactly which input to manipulate and by how much. If the model weights geographic distance heavily, you route your fraudulent transactions through a location close to the victim&#8217;s last legitimate purchase. If it weights transaction amount, you stay under the threshold where that coefficient drives the score above the alert trigger.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The interpretability that makes linear regression trustworthy for defenders is the same property that makes it transparent to attackers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How the model learns<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The standard method for fitting a linear regression model is Ordinary Least Squares (OLS). The process is mechanical:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>For each data point, calculate the residual which is the difference between the actual value and the model&#8217;s predicted value.<\/li>\n\n\n\n<li>Square each residual, so negative and positive errors are treated equally and large errors are penalised more heavily.<\/li>\n\n\n\n<li>Sum all the squared residuals to produce the Residual Sum of Squares (RSS).<\/li>\n\n\n\n<li>Adjust the coefficients to minimise that sum.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">The result is the line (or hyperplane, in multiple regression) that minimises total squared error across the training data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">From a red teaming perspective, the squared-error objective creates a specific vulnerability. Because OLS penalises large errors quadratically, outliers have disproportionate influence on the fitted model. A single extreme data point can pull the regression line significantly. This is the mechanism behind data poisoning attacks against linear models: you do not need to corrupt the entire dataset. A small number of carefully placed outliers can shift the coefficients enough to change downstream decisions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Researchers at Google Brain demonstrated <a href=\"https:\/\/arxiv.org\/abs\/1703.04730\" title=\"this in 2017\">this in 2017<\/a> with a targeted poisoning attack against linear regression models used in loan approval systems. By <a href=\"https:\/\/gangw.cs.illinois.edu\/class\/cs562\/papers\/poison-sp18.pdf\" title=\"\">injecting fewer than 3% poisoned<\/a> samples into the training data, they shifted approval thresholds enough to systematically alter outcomes for specific demographic groups, without triggering standard data quality checks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The assumptions, and why attackers care about them<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Linear regression makes four assumptions about the data it is trained on. Each assumption, when violated, degrades the model&#8217;s reliability. For a defender, that means checking assumptions before trusting the model. For a red teamer, it means understanding which assumption to violate.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Linearity.<\/strong>&nbsp;The model assumes the relationship between inputs and output is a straight line. If the real relationship is curved or nonlinear, the model will systematically mispredict in certain regions. An attacker operating in those mispredicted regions benefits from the model&#8217;s blind spot. The model is confident in its prediction. The prediction is wrong.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Independence.<\/strong>&nbsp;Each observation in the training data is assumed to be independent of every other observation. In time-series security data (logs, traffic flows, alert sequences), this assumption is almost always violated. Events are correlated. An attacker who understands the temporal structure of the data can exploit the gap between what the model assumes and what the data actually contains.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Homoscedasticity.<\/strong>&nbsp;The variance of the errors should be constant across all levels of the input variables. In plain terms: the model should be equally accurate whether it is predicting a low value or a high value. When this fails (heteroscedasticity), the model&#8217;s confidence intervals are wrong. It might report high confidence in a region where it is actually unreliable. For an attacker, this means operating in the region where the model&#8217;s error variance is highest, because that is where predictions are least trustworthy and anomalies are hardest to distinguish from noise.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Normality of errors.<\/strong>&nbsp;The residuals should follow a normal distribution. This matters less for prediction accuracy and more for statistical inference: confidence intervals, p-values, and hypothesis tests all depend on this assumption. In a security context, if residuals are not normally distributed, the model&#8217;s uncertainty estimates are wrong. A risk score reported as &#8220;95% confident&#8221; might actually be far less reliable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What this means for the series<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Linear regression is the first model in this series for a reason. It is the simplest supervised learning algorithm with real production deployment, and its simplicity makes the attack surface legible. Every concept introduced here (coefficient manipulation, data poisoning via outlier injection, assumption violation as an attack vector) scales up to more complex models. The difference is that in linear regression, you can see the mechanism working. In a neural network, the same vulnerabilities exist but the internals are opaque.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you are building an AI red teaming methodology, start here. Understand how OLS fits a line. Understand why outliers move it. Understand what happens when the assumptions break. Then carry that understanding forward into gradient-boosted trees, support vector machines, and eventually deep learning. The principles transfer. The complexity increases, but the logic of manipulation does not change i.e. find what the model trusts, and subvert it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Linear regression powers SIEM scoring, fraud detection, and baselines. Here is how it works, and why red teamers need to understand it before anything else.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[640,630,51,635,637,136,638,639,631,636],"class_list":["post-390","post","type-post","status-publish","format-standard","hentry","category-multifarious","tag-adversarial-ai","tag-ai-red-teaming","tag-cybersecurity","tag-data-poisoning","tag-linear-regression","tag-machine-learning","tag-model-security","tag-siem","tag-supervised-learning","tag-threat-modelling"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/390","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/comments?post=390"}],"version-history":[{"count":2,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/390\/revisions"}],"predecessor-version":[{"id":399,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/390\/revisions\/399"}],"wp:attachment":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/media?parent=390"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/categories?post=390"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/tags?post=390"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}