{"id":439,"date":"2026-05-04T00:00:00","date_gmt":"2026-05-03T23:00:00","guid":{"rendered":"https:\/\/kosokoking.com\/?p=439"},"modified":"2026-04-26T21:44:43","modified_gmt":"2026-04-26T20:44:43","slug":"generative-ai","status":"publish","type":"post","link":"https:\/\/kosokoking.com\/index.php\/technology\/generative-ai\/","title":{"rendered":"Generative AI"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">A neural network trained to classify will tell you what something is. A generative model will produce something that never existed and dare you to prove it is not real. That shift, from prediction to creation, is where the attack surface changes fundamentally. Adversarial examples against a classifier trick a model into mislabelling an input. Adversarial attacks against a generative model can produce synthetic content that is indistinguishable from reality, reconstruct private training data, or hijack the generation process itself to produce whatever the attacker wants.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The previous articles in this series covered the building blocks: how neural networks learn through gradient descent, how layers of connected perceptrons approximate complex functions, and why gradient-based learning is both the strength and the structural vulnerability of deep learning. Generative AI takes those same mechanics and points them at a different objective. Instead of learning to draw a boundary between classes, generative models learn to reproduce the statistical structure of their training data well enough to sample new instances from it. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What &#8220;generative&#8221; actually means<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A generative model learns the probability distribution of its training data. That sentence is dense, so let us unpack it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Every dataset has a shape. If you train on thousands of photographs of human faces, there are statistical regularities in that data: eyes tend to appear in certain positions relative to the nose, skin tones cluster in predictable ranges, backgrounds follow common patterns. A generative model learns those regularities well enough to produce new images that obey the same statistical rules. The generated face is not a copy of any training example. It is a new sample drawn from the learned distribution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is different from classification, where the model learns a decision boundary. A classifier trained on faces and non-faces learns where the boundary sits between them. A generative model trained on faces learns the internal structure of what makes a face a face, and then produces new ones.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For a red teamer, the distinction matters because the attack targets are different. Against a classifier, you are trying to push an input across a boundary. Against a generative model, you are either trying to manipulate what it produces, extract what it learned, or exploit the gap between its learned distribution and reality.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The four architectures you need to know<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Generative AI is not one technique. It is a family of architectures, each with a different mechanism for learning and sampling from data distributions. The four that matter are GANs, VAEs, autoregressive models, and diffusion models. Each has a distinct attack surface.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Generative adversarial networks<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">GANs are built on a competition. Two neural networks, the generator and the discriminator, are trained simultaneously in opposition. The generator produces synthetic data, initially random noise, and the discriminator evaluates whether each sample is real (from the training set) or fake (from the generator). The generator&#8217;s objective is to fool the discriminator. The discriminator&#8217;s objective is to catch fakes. As training progresses, the generator gets better at producing realistic outputs and the discriminator gets better at spotting them, until (ideally) the generator produces content that the discriminator cannot reliably distinguish from real data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The adversarial training loop is the defining feature. It is also the source of GANs&#8217; most exploitable behaviours.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Mode collapse<\/strong>&nbsp;is the failure condition a red teamer should understand first. It happens when the generator discovers a narrow set of outputs that consistently fools the discriminator and stops exploring. The result is a model that produces convincing but repetitive content, generating the same face with minor variations, the same malware signature with trivial modifications, the same synthetic log pattern on repeat. A GAN-based data augmentation pipeline suffering from mode collapse is generating less diverse training data than it appears to, which means any downstream model trained on that data has blind spots the augmentation was supposed to eliminate.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Training instability<\/strong>&nbsp;is the second. GANs are notoriously difficult to train because the generator and discriminator must improve at roughly the same rate. If the discriminator gets too good too fast, the generator receives no useful gradient signal and stops learning. If the generator outpaces the discriminator, it learns to exploit specific weaknesses in the discriminator&#8217;s evaluation rather than actually producing better content. Both failure modes produce models with predictable blind spots.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">From a red teaming perspective, GANs have been used extensively in both offensive and defensive contexts. On the offensive side, deepfake generation is the obvious application, but GANs have also been used to generate adversarial examples that transfer across models. The DG-GAN framework, published in <a href=\"https:\/\/www.nature.com\/articles\/s41598-024-83444-x\" title=\"\">a 2025 paper<\/a> in Scientific Reports, demonstrated a bidirectional approach where the same architecture can both generate adversarial examples and defend against them, effectively mapping the relationship between clean and adversarial inputs through a paired generator-encoder structure. On the defensive side, GANs are used in cybersecurity for synthetic data generation (augmenting sparse attack datasets for training intrusion detection systems) and for adversarial training (hardening classifiers by exposing them to GAN-generated adversarial inputs during training).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Variational autoencoders<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">VAEs take a different approach. Instead of an adversarial game, they learn a compressed representation of the data called a latent space, and use it to generate new samples.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The architecture has two halves. The encoder compresses input data into a low-dimensional latent representation. The decoder reconstructs data from that latent representation. The critical detail is that the latent space is structured as a probability distribution (typically Gaussian), not as a set of fixed points. This means you can sample new points from the latent space and decode them into novel outputs that are consistent with the training data&#8217;s statistical properties.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The latent space is the map. Similar data points cluster together in latent space, and smooth interpolation between points produces smooth transitions in the output. If you train a VAE on faces, moving along one axis of the latent space might smoothly transition from light to dark hair. Moving along another might shift from young to old.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For a red teamer, the latent space is both the asset and the vulnerability. It is a compressed, navigable representation of everything the model learned from its training data. If you can access or reverse-engineer the latent space, you can systematically explore what the model knows, generate targeted outputs by choosing specific regions of the space, and potentially reconstruct training data by probing the boundaries of the learned distribution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Model inversion attacks exploit this directly. <a href=\"https:\/\/openreview.net\/forum?id=TvhEoz1nim\" title=\"\">A 2024 paper<\/a> at ICLR explored replacing GAN generators in model inversion attacks with single-step generators distilled from diffusion models, demonstrating that the quality of reconstructed training data improves significantly when the generative prior is stronger. The implication is that as generative models improve at producing realistic content, they also become more effective tools for extracting private information from other models.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Autoregressive models<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Autoregressive models generate content sequentially, one element at a time, where each new element is conditioned on all previous elements. You have used one. Every large language model, from GPT to Claude to Gemini, is an autoregressive model. It predicts the next token (a word fragment, roughly) based on all preceding tokens.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The generation process is straightforward: given a sequence of tokens, the model outputs a probability distribution over all possible next tokens, one is selected (through sampling, with temperature and top-p controls shaping the randomness), it is appended to the sequence, and the process repeats. The entire output is built token by token, left to right.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This sequential generation creates the attack surface that the <a href=\"https:\/\/genai.owasp.org\/llm-top-10\/\" title=\"\">OWASP Top 10 for LLM Applications <\/a>was built to catalogue. Prompt injection, the top risk on that list, works because the model cannot distinguish between instructions from the system prompt and instructions embedded in user input. Both are just tokens in the sequence. The model processes them identically because, architecturally, it has no concept of privilege levels within its input context.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The 2025 OWASP list identifies ten categories of LLM vulnerability: prompt injection, sensitive information disclosure, supply chain risks, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption. Every one of these is rooted in the autoregressive architecture&#8217;s fundamental properties. The model generates text by predicting the statistically most likely continuation. It has no internal model of truth, no access control mechanism within the context window, and no inherent concept of whether its output is being used as a database query, a shell command, or a friendly chat message.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Indirect prompt injection is the variant that should concern practitioners most. An attacker does not need to interact with the model directly. They embed adversarial instructions in a document, a web page, or a database record that the model will later retrieve and process. When the model ingests that content as part of a retrieval-augmented generation (RAG) pipeline, the injected instructions become part of the context and influence the output. Microsoft&#8217;s AI Red Team documented cases where hidden instructions in images, resumes, and web pages successfully manipulated LLM copilots during their <a href=\"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2025\/01\/13\/3-takeaways-from-red-teaming-100-generative-ai-products\/\" title=\"\">assessment of over 100 generative AI products<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Diffusion models<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Diffusion models are the newest major architecture, and they work by learning to reverse a noise process. Training involves two phases. The forward process gradually adds Gaussian noise to training images until the original image is pure static. The reverse process trains a neural network to undo that noise, one small step at a time, recovering the original image from the static.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Generation works by starting from pure random noise and running the learned reverse process. Each step removes a small amount of noise, gradually resolving a coherent image from chaos. The quality of the output depends on how well the model learned the noise-to-signal mapping during training.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Stable Diffusion, DALL-E, and Midjourney all use this architecture (or variants of it) for image generation. Text-to-image models add a conditioning mechanism: the denoising process is guided by a text embedding, so the model generates images that match a text prompt.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The attack surface of diffusion models is broad and still being mapped. An <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3721479\" title=\"\">ACM Computing Surveys paper<\/a> catalogued attacks across four categories: adversarial attacks (perturbing inputs to cause misgeneration), membership inference attacks (determining whether a specific image was in the training data), backdoor injection (embedding hidden triggers that cause specific outputs when activated), and multi-modal threats (exploiting the interaction between text and image modalities).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Backdoor injection is particularly concerning. An attacker who can poison a small portion of the training data can insert a trigger pattern, a specific watermark, phrase, or pixel arrangement, that causes the model to generate a specific output when the trigger is present in the prompt. The model behaves normally on all other inputs. The OWASP Top 10 for LLMs flags data and model poisoning as a core risk, and diffusion models are especially susceptible because they are commonly fine-tuned on user-contributed datasets where quality control is minimal.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The concepts that connect them<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Certain ideas recur across all four architectures. Understanding them as a red teamer means understanding where the common vulnerabilities live.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Latent space as attack surface<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Every generative model (except the simplest autoregressive ones) operates through some form of latent representation. This compressed internal space is where the model&#8217;s knowledge is encoded, and it is where many of the most effective attacks operate. Model inversion reconstructs training data by probing the latent space. Adversarial attacks craft inputs that target specific regions. Backdoor triggers map to specific latent space coordinates that the attacker has conditioned the model to associate with a desired output.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Sampling as a controllable process<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Generation is not deterministic. It involves sampling from a learned distribution, and the sampling parameters (temperature, top-p, noise schedules, guidance scales) control the tradeoff between quality and diversity. From a security perspective, these parameters are controls that affect the attack surface. A model with high temperature generates more diverse but less predictable outputs. A model with low temperature is more deterministic and more predictable. Red teamers should understand these controls because they determine how reproducible an attack is and how much variance there is in the model&#8217;s failure modes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Overfitting as information leakage<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When a generative model overfits, it memorises its training data rather than learning the underlying distribution. This is a privacy catastrophe. An overfit model can be prompted to reproduce training examples verbatim, which means any sensitive data in the training set (personal information, proprietary code, medical records, credentials) is potentially extractable. Membership inference attacks test for exactly this condition: given a data point, can you determine whether it was in the training set by analysing the model&#8217;s behaviour? Research has demonstrated that both GANs and diffusion models are susceptible to these attacks, and that the risk increases with model size and training duration.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Evaluation metrics and their blind spots<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The standard metrics for evaluating generative AI, Inception Score (IS) for image quality, Frechet Inception Distance (FID) for distribution similarity, and BLEU for text generation, all measure statistical properties of the output. None of them measure safety. A GAN with an excellent FID score might be producing photorealistic deepfakes. An LLM with strong BLEU scores might be generating plausible-sounding misinformation. These metrics tell you the model is generating statistically convincing content. They tell you nothing about whether that content is harmful, private, or manipulated.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">NIST&#8217;s <a href=\"https:\/\/nvlpubs.nist.gov\/nistpubs\/ai\/NIST.AI.100-2e2025.pdf\" title=\"\">AI 100-2e2025 report<\/a> on adversarial machine learning separates attacks on generative systems into training-time (poisoning, backdoors) and deployment-time (evasion, privacy extraction) categories. The evaluation metrics in common use address neither. A red teamer who relies on standard evaluation metrics to assess a generative model&#8217;s safety is measuring the wrong thing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What this means for red teaming<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the generative AI landscape, the red teamer&#8217;s objective moves beyond simple misclassification to include forcing the production of specific content and the revelation of protected data or other behaviors that contradict the intentions of the operators.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The OWASP GenAI Security Project published its <a href=\"https:\/\/genai.owasp.org\/2026\/04\/14\/owasp-genai-exploit-round-up-report-q1-2026\/\" title=\"\">Q1 2026 exploit round-up<\/a> covering January through April 2026, and the pattern is clear. The attacks that land in production are not theoretical. They are prompt injections embedded in documents that RAG pipelines ingest, fine-tuning poisoning through community model hubs, and membership inference probes against models trained on sensitive data. The Carnegie Mellon SEI published <a href=\"https:\/\/doi.org\/10.1184\/R1\/29410136\" title=\"\">a report<\/a> arguing that AI red teaming needs to borrow methodologies from cybersecurity red teaming, specifically structured threat modelling, defined scoping, and standardised reporting, because the current approach in most organisations is ad hoc prompt fuzzing with no framework for measuring coverage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By breaking vulnerabilities into technique and tactic as well as weakness and impact, Microsoft&#8217;s AI Red Team provides a framework that fits the architectures in this article. One such application is a backdoor injection intended for misinformation that leverages poor data validation in the fine-tuning pipeline to give an attacker control over the model&#8217;s output.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Generative AI: How GANs, VAEs, autoregressive models, and diffusion models work, and the specific attack surfaces each architecture exposes to AI red teamers.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[668,630,51,709,706,73,662,707,710,708],"class_list":["post-439","post","type-post","status-publish","format-standard","hentry","category-technology","tag-adversarial-machine-learning","tag-ai-red-teaming","tag-cybersecurity","tag-diffusion-models","tag-gans","tag-generative-ai","tag-machine-learning-security","tag-model-inversion","tag-owasp-top-10-llm","tag-prompt-injection"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/439","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/comments?post=439"}],"version-history":[{"count":1,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/439\/revisions"}],"predecessor-version":[{"id":440,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/posts\/439\/revisions\/440"}],"wp:attachment":[{"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/media?parent=439"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/categories?post=439"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kosokoking.com\/index.php\/wp-json\/wp\/v2\/tags?post=439"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}