Large language models (LLMs)

Every neural network vulnerability discussed in this series so far operates on tensors that represent images, tabular rows, or sensor readings. Large language models operate on language-mapped tensors, fundamentally shifting the security landscape. Because the input is a variable-length sequence of natural language tokens rather than a fixed-dimension vector, the attacker’s interface is simply a conversation. There is no compiled exploit or binary payload—just words, orchestrated to subvert the model’s intended design. The previous articles in this series covered neural networks, deep learning fundamentals, and the gradient-based mechanics that make adversarial machine learning possible. LLMs inherit these risks and add a new dimension through the transformer architecture. This architecture processes language in a pipeline where every stage introduces unique vulnerabilities. From tokenization to attention and decoding, each step presents its own category of exploitable behavior.

Transformers

LLMs are built on the transformer, a neural network architecture that Google researchers introduced in 2017. Before transformers, sequence processing relied on recurrent neural networks (RNNs) that read text one token at a time, maintaining a hidden state that compressed prior context into a fixed-size vector. The compression was lossy. By the time an RNN reached the end of a long paragraph, the information from the opening sentence had degraded significantly.

Transformers solved this by abandoning sequential processing entirely. Instead of reading left to right, a transformer processes every token in the input simultaneously through a mechanism called self-attention. Self-attention allows each token to compute a relevance score against every other token in the sequence, regardless of distance. The word “it” at position 47 can directly attend to the noun it references at position 3 without that reference being compressed through 44 intermediate steps.

This parallelism allows transformers to scale, but it also invites a new class of attacks. Unlike RNNs, transformers are vulnerable to token sequences engineered to manipulate attention scores. By targeting specific positions, an attacker can effectively hijack which parts of the input the model deems important.

The inference pipeline, stage by stage

Understanding where LLM attacks land requires understanding how text moves through the model. There are five stages, and each one creates a distinct attack surface.

Tokenisation

Before an LLM can process text, it converts the input into tokens using an algorithm like Byte Pair Encoding (BPE). BPE builds a vocabulary of variable-length units by merging the most frequent character sequences found in the training data. The result is a tokenisation scheme that is statistical, not linguistic. It does not split text on word boundaries or morphological structure. It splits on frequency patterns.

This matters for red teaming because tokenisation boundaries do not align with human-readable word boundaries. A keyword blocklist that scans raw text for “powershell” will fail if the tokeniser splits the input into ["power", "shell"] and the filter operates at the string level while the model operates at the token level. Phil Stokes at SentinelOne demonstrated this class of filter bypass in a January 2026 analysis of LLM inference pipelines, showing that the disconnect between string-level security filters and token-level model processing is a structural gap, not an implementation bug. The model reassembles the semantic meaning from fragments. The filter never sees the reassembled word.

Adversarial tokenization research identifies this as a distinct attack category. Common techniques like character substitution, deliberate typos, and encoding shifts exploit a fundamental architectural seam. The model and its safety filters simply do not operate on the same representation of the input.

Embeddings

Once tokenised, each token is mapped to a discrete integer (a token ID) and then converted into an embedding, a high-dimensional vector of continuous floating-point numbers. These vectors encode semantic relationships. The embedding for “exploit” sits closer to “vulnerability” than to “bicycle” in this vector space.

If you have followed the neural networks article in this series, you already know why continuous representations are dangerous. Gradients can be computed with respect to them. Small perturbations along specific dimensions of the embedding vector can shift the model’s interpretation of a token without changing its surface-level meaning. Researchers have demonstrated embedding space attacks that achieve attack success rates above 96% across multiple safety-aligned LLMs by injecting optimised perturbations directly into embedding layer outputs, bypassing safety alignment without modifying model weights or input text.

For open-source models where the embedding layer is directly accessible, this is a particularly clean attack vector. The attacker does not need to craft clever prompts. They optimise a perturbation vector in embedding space, targeting the continuous representations that the safety alignment was trained to recognise. The alignment sees a benign embedding. The rest of the network processes a malicious one.

Positional encoding

Transformers have no inherent sense of word order. Positional encoding adds a unique vector to each token’s embedding to tell the model where that token sits in the sequence. This gives the architecture its understanding of grammar, syntax, and narrative flow.

This architecture also creates a hard boundary known as the context window. Every transformer has a maximum sequence length determined by its positional encoding scheme. When the input exceeds this limit, the model must either truncate the data or suffer from degraded performance. Attackers exploit this by padding prompts with irrelevant content to push safety-critical instructions out of the window, or by front-loading the context with material that drowns out the system prompt in the attention computation.

Self-attention

Self-attention is the computational core of the transformer. For every token, the model computes a query, a key, and a value vector. The query defines what a token is looking for, the key represents what it offers, and the value contains the information it carries. Attention scores result from comparing every query against every key. This produces a matrix that determines how much influence each token exerts over the others.

The OWASP Top 10 for LLM Applications 2025 lists prompt injection as the number one vulnerability, and self-attention is the mechanism through which it operates. When an attacker injects a carefully crafted instruction into a prompt, they are engineering tokens whose query-key interactions produce high attention scores against the model’s system-level instructions, effectively overriding them. The injected text does not need to be semantically meaningful to a human reader. It needs to produce attention scores that redirect the model’s internal focus.

This is why adversarial suffixes work. Sequences of tokens that appear as gibberish to a human can be optimised to produce specific attention patterns that suppress safety-aligned behaviour. The attack does not target the model’s “understanding” of the prompt. It targets the linear algebra that computes which tokens influence which outputs.

Decoding

The final stage converts the model’s internal representations back into text. The model outputs a probability distribution over its entire vocabulary for the next token, selects one (using strategies like temperature scaling, top-k sampling, or nucleus sampling), appends it to the sequence, and repeats. This autoregressive generation is why LLMs produce text one token at a time.

From a red teaming perspective, the decoding stage is where hallucinations originate. The model does not “know” facts. It predicts statistically likely continuations. When no high-probability continuation exists for a factual question, the model generates a plausible-sounding but fabricated answer because the architecture has no mechanism for expressing uncertainty, only for selecting the next token.

The three properties that define the attack surface

LLMs exhibit three characteristics that compound the per-stage vulnerabilities described above.

Scale amplifies exposure. Models with billions of parameters have learned from training corpora that inevitably contain sensitive data, biased patterns, and exploitable associations. The larger the model, the more material an attacker can elicit through targeted prompting.

Few-shot learning enables prompt-based exploitation. LLMs can adopt new behaviours from a handful of examples provided in the prompt itself. This is useful for legitimate applications, but it is equally useful for an attacker who provides carefully constructed few-shot examples that teach the model to bypass its own safety guidelines within a single conversation.

Contextual understanding creates manipulation surfaces. The model’s ability to maintain coherent context across a long conversation means an attacker can gradually shift the model’s behaviour across multiple turns, a technique that multi-turn jailbreak research has shown to be consistently more effective than single-turn prompt injection.

Where this fits in the red teaming toolkit

The OWASP GenAI Security Project published an updated AI security solutions landscape in April 2026, covering both generative AI and agentic red teaming. The framework identifies prompt injection, model misuse, agent privilege escalation, data poisoning, hallucinations, and emergent behaviours as the primary risk categories for LLM deployments. Tools like NVIDIA’s Garak, Microsoft’s PyRIT, and open-source frameworks like DeepTeam and Promptfoo now provide automated adversarial testing capabilities specifically designed for these attack surfaces.

The shift for traditional penetration testers is significant. LLM red teaming sits at the intersection of security testing and applied machine learning. You are no longer scanning for open ports or misconfigured access controls. Instead, you are engineering inputs to exploit the mathematical properties of attention mechanisms, embedding spaces, and probabilistic token generation.

What this means for defenders

The defensive challenge with LLMs is that the vulnerabilities are architectural.You cannot patch prompt injection like a traditional software bug. It is an inherent feature of how transformers work. Since instructions and data are both converted into the same token sequences, the attention mechanism treats them with the same mathematical priority. Keyword filters operate on raw strings while the model operates on token embeddings, which means evasion is structurally easier than detection.

Effective mitigation requires layered approaches. Input sanitisation at the token level rather than the string level. Output filtering through secondary classifier models that evaluate whether the primary model’s response violates safety policies. Retrieval-augmented generation to ground outputs in verified data sources. Continuous adversarial testing through the development lifecycle rather than periodic assessments. None of these are complete solutions. All of them reduce the blast radius.

The honest conclusion is that LLMs are the first widely deployed technology where the primary attack interface is natural language, and the security community is still working out what that means in practice. Every stage of the inference pipeline, from the way text is split into tokens to the way attention scores are computed to the way outputs are sampled, creates opportunities for adversarial manipulation that traditional application security was never designed to detect.

Leave a Reply

Your email address will not be published. Required fields are marked *

RELATED

Bayesian spam classification: the dataset

Preparing the SMS Spam Collection dataset for Bayesian classification, covering download, extraction, loading, and cleaning through an adversarial lens.

Spam classification: Naive Bayes filters

How Naive Bayes spam filters work, why the independence assumption makes them exploitable, and how GoodWords attacks broke email filtering…

Metrics for evaluating a model

Learn how accuracy, precision, recall, and F1-score work in practice, where each metrics deceive, and how adversaries exploit the gaps…

Python libraries for AI red teaming

Python Libraries: How scikit-learn and PyTorch work, and why their APIs are the operational foundation for adversarial machine learning.