The AI glossary
Most AI glossaries read like they were written by someone who attended a conference keynote and took notes on a napkin. They define terms by restating the term in longer words. They avoid opinions entirely, because opinions require understanding, and understanding requires having actually used the thing. This is a different kind of glossary. If you work in technology or security, these are the terms you will encounter because they are written by someone who has watched these systems fail in ways that marketing pages never mention. Here is what they actually mean.
AGI
Artificial general intelligence. A term with no agreed definition, which should tell you everything about how close we are to building it.
OpenAI’s Sam Altman has described AGI as the equivalent of a median human coworker. OpenAI’s own charter defines it as systems that outperform humans at most economically valuable work. Google DeepMind frames it as AI matching human capability across most cognitive tasks. Three definitions from two organisations, and none of them come with a measurement framework.
The honest answer is that AGI is a moving goalpost used primarily in fundraising decks and long-term research roadmaps. Every time AI systems clear a benchmark that was once considered a marker of general intelligence (chess, Go, medical exams, coding competitions), the definition shifts. AGI is always five to ten years away. It has been five to ten years away for decades.
For practitioners, the term is largely irrelevant to daily work. What matters is whether a specific model can perform a specific task reliably enough to deploy. That question has a testable answer. AGI does not.
AI agent
An AI agent is a system that chains multiple operations together to complete a multi-step task with minimal human intervention. Where a chatbot takes a single prompt and returns a single response, an agent might receive a goal (“book the cheapest flight to Berlin next Tuesday”), decompose it into sub-tasks (search flights, compare prices, check calendar conflicts, enter payment details), and execute them sequentially or in parallel.
The concept is straightforward. The reality is messier. Most agent frameworks today are fragile orchestration layers on top of large language models, held together with prompt engineering and retry logic. They break in predictable ways, including instances where the model misinterprets an intermediate step, hallucinates a nonexistent tool invocation, or enters a loop. Error handling is primitive compared to traditional software engineering, because the execution path is non-deterministic.
The infrastructure to make agents genuinely reliable (robust tool APIs, sandboxed execution environments, proper state management, and meaningful observability) is still being built. The term “AI agent” is currently doing more work in product announcements than it is in production environments.
Chain of thought
Chain-of-thought reasoning is the practice of forcing a large language model to show its working before producing an answer. Instead of jumping directly to a conclusion, the model generates intermediate reasoning steps, which tends to improve accuracy on problems that require logic, arithmetic, or multi-step deduction.
The mechanism is worth understanding. LLMs predict the next token in a sequence. When you ask a direct question, the model generates an answer based on pattern matching against its training data. For simple factual recall, this works. For anything requiring inference across multiple premises, pattern matching alone produces confident nonsense.
Chain-of-thought prompting, and the reasoning models trained to use it by default, changes the distribution of tokens the model generates. By producing reasoning steps first, the model effectively conditions its own output on a longer, more structured context. The answer at the end of a chain-of-thought trace is often better because the trace itself constrains what tokens are plausible.
A straightforward trade-off exists because reasoning models are slower and more expensive per query. They consume more tokens (both for processing and output), which means higher latency and higher cost. For tasks that genuinely require multi-step logic, this is worth it. For tasks that do not, you are paying a tax for reasoning the model did not need to do.
Compute
Compute is the raw processing capacity required to train and run AI models. In practice, the word is shorthand for GPU hours, because the modern AI stack runs almost entirely on graphics processing units repurposed for matrix multiplication.
The economics of compute define the AI industry more than any algorithmic breakthrough. Training a frontier model requires thousands of GPUs running continuously for weeks or months. The electricity bill alone can reach millions. Access to compute is the primary bottleneck for most AI labs, and the reason that a small number of companies with massive capital reserves (and the ability to secure NVIDIA hardware at scale) dominate the field.
When someone says a model required “more compute” than its predecessor, they mean it consumed more GPU hours during training. When a cloud provider sells “AI compute”, they are renting you GPU time. The entire economic layer of AI, from pricing to competitive advantage to geopolitical tension over chip exports, traces back to this single resource.
Deep learning
Deep learning is machine learning with neural networks that have many layers. That is the entire distinction. The word “deep” refers to the depth of the network architecture (the number of layers between input and output), not to any philosophical quality of the learning itself.
Each layer in a deep neural network transforms its input and passes the result to the next layer. Early layers tend to learn low-level features (edges in an image, phonemes in audio), while later layers combine these into higher-level representations (faces, words, concepts). This hierarchical feature extraction is what makes deep learning effective for tasks like image recognition, speech processing, and natural language understanding.
The practical trade-off is data and compute. Deep learning models are hungry. They need large datasets to generalise well, and they need significant processing power to train. A three-layer network can be trained on a laptop. A model with billions of parameters across hundreds of layers requires a data centre. The performance gains from depth are real, but they come at a cost that scales non-linearly.
Diffusion
Diffusion models generate data by learning to reverse a noise process. During training, the model takes clean data (an image, an audio clip) and progressively adds random noise until the original signal is destroyed. It then learns to run this process backwards in a sequence in which it predicts the slightly less noisy version of data repeatedly until a clean output emerges.
This is the mechanism behind Stable Diffusion, DALL-E, and most modern image generation tools. The reason diffusion works well is that the denoising task is easier to learn than direct generation. Instead of trying to produce a coherent image from nothing (which is a very hard optimisation problem), the model only needs to remove a small amount of noise at each step. Across hundreds of steps, these small corrections compound into a coherent result.
The cost is inference speed. Because generation requires many sequential denoising steps, diffusion models are slower than single-pass generators. Techniques like classifier-free guidance, latent diffusion (operating in a compressed representation space rather than pixel space), and step-reduction methods have improved this, but the fundamental architecture remains iterative.
Distillation
Distillation transfers knowledge from a large model to a smaller one. The large model (the teacher) generates outputs on a dataset. The smaller model (the student) is then trained to replicate those outputs rather than learning from raw data directly.
The student model learns the teacher’s behaviour, including the soft probability distributions over possible answers, not just the final prediction. These soft targets contain information about which wrong answers the teacher considered plausible, and this information turns out to be valuable training signal. A student trained on soft targets from a strong teacher often outperforms an identically sized model trained from scratch on the same raw data.
This is how companies ship fast, cheap models that punch above their weight. It is also how some labs have been credibly accused of bootstrapping from competitors. If you distill from another company’s API, you are extracting their training investment through their own inference endpoint. Most terms of service prohibit this explicitly. Enforcement is another matter.
Fine-tuning
Fine-tuning takes a pre-trained model and continues training it on a narrower dataset to improve performance on a specific task or domain. The base model provides general language understanding. The fine-tuning data teaches it the patterns, terminology, and expected outputs for a particular use case.
The economics are favourable. Training a foundation model from scratch costs millions. Fine-tuning one costs thousands, sometimes less. This asymmetry is the reason most commercial AI products are fine-tuned variants of someone else’s base model, not models trained from zero.
The failure mode is also worth knowing. Fine-tuning on a small or poorly curated dataset can degrade the model’s general capabilities (a problem called catastrophic forgetting) or bake in biases present in the fine-tuning data. The model does not evaluate whether the new data is correct. It adjusts its weights to fit whatever you feed it. Garbage in, confidently stated garbage out.
GAN
A generative adversarial network pits two neural networks against each other. One (the generator) produces synthetic data. The other (the discriminator) tries to distinguish synthetic data from real data. The generator improves by trying to fool the discriminator. The discriminator improves by getting better at spotting fakes. Both networks sharpen each other through competition.
GANs were the dominant architecture for realistic image generation before diffusion models overtook them. Face generation, style transfer, data augmentation, and the notorious creation of deepfakes represent the specific applications where they remain relevant. The adversarial training structure produces sharp, realistic outputs without the iterative denoising overhead of diffusion.
The weakness is training instability. GANs are difficult to train well. The two networks can fall out of equilibrium (mode collapse, where the generator produces a narrow range of outputs that reliably fool the discriminator), and hyperparameter tuning is more finicky than with other architectures. Diffusion models won the generative AI race partly because they are easier to train reliably at scale.
Hallucination
Hallucination is the industry’s polite term for an AI model generating false information with full confidence. The model does not know it is wrong. It has no mechanism for knowing. It produces the statistically most likely sequence of tokens given its input, and sometimes that sequence is factually incorrect.
The mechanism is structural, not a bug to be patched. LLMs are trained on text, not on verified truth. They learn patterns of language, including patterns that look like authoritative statements. When a model encounters a prompt that falls outside its training distribution, or that sits in a region where its training data was sparse or contradictory, it does not say “I don’t know.” It generates the most plausible-sounding completion. Plausible-sounding and correct are different things.
This is the reason that every serious deployment of LLMs in production includes some form of grounding: retrieval-augmented generation (pulling from verified sources), tool use (letting the model call APIs for factual data), or human review. Running an LLM without grounding in any context where accuracy matters is an engineering decision you will regret.
Inference
Inference is the production workload of AI. Training builds the model. Inference runs it. Every time you send a message to ChatGPT, Claude, or any other AI assistant, you are triggering an inference pass: the model processes your input, runs it through its layers, and generates an output.
The cost profile of inference is different from training. Training is a one-time (or periodic) capital expense: a massive burst of compute to produce a set of weights. Inference is an ongoing operational expense: every user query costs money. For popular AI products serving millions of users, inference costs dwarf training costs over time.
Hardware optimisation for inference is a major area of investment. Techniques like quantisation (reducing the numerical precision of weights to lower memory and compute requirements), speculative decoding (predicting multiple tokens in parallel), and KV caching (storing intermediate computations so they do not need to be recomputed) all aim to reduce the per-query cost without meaningfully degrading output quality.
Large language model (LLM)
A large language model is a neural network, typically a transformer, trained on a massive corpus of text to predict the next token in a sequence. That is the entire mechanism. Everything an LLM does, from writing code to answering questions to generating poetry, emerges from next-token prediction trained at sufficient scale.
The “large” refers to parameter count. Modern LLMs have billions of parameters (weights) spread across dozens or hundreds of layers. These parameters encode statistical relationships between tokens, learned from training data that typically includes books, articles, code repositories, and web crawls. The model does not store facts as a database does. It stores patterns of co-occurrence that allow it to generate plausible continuations of any text prefix.
This architecture explains both the strengths and weaknesses of LLMs. They are remarkably fluent and flexible because human language is full of statistical regularities, and the model has learned an enormous number of them. They hallucinate because they optimise for plausibility, not truth. They struggle with precise arithmetic because arithmetic is not a pattern-matching task. Understanding what an LLM actually is (a very large autocomplete engine trained on text) makes its behaviour predictable rather than magical.
Memory cache
KV caching is an optimisation technique for transformer-based models that stores the key and value tensors computed during previous tokens so they do not need to be recomputed when generating the next token. Without caching, a transformer generating a 1,000-token response would recompute attention over all previous tokens at every step. With KV caching, each step only computes attention for the new token against the stored keys and values.
The impact on inference performance is significant. KV caching reduces the computational complexity of autoregressive generation from quadratic (in sequence length) to linear. For long outputs, this is the difference between a response taking seconds and taking minutes.
The trade-off is memory. KV caches grow linearly with sequence length and batch size. For models serving many concurrent users with long contexts, cache memory can become the binding constraint on throughput. This is why context window length is not free: a model that supports a 128k token context window needs proportionally more memory per user session than one capped at 8k, even if most queries only use a fraction of that window.
Neural network
A neural network is a computational graph of nodes (neurons) organised in layers, where each connection between nodes carries a learnable weight. Input data enters at one end, gets transformed through successive layers of weighted sums and non-linear activation functions, and produces an output at the other end. Training adjusts the weights to minimise the difference between the model’s output and the desired output.
The name comes from a loose analogy with biological neurons, and the analogy is about fifty years past its usefulness. Modern neural networks bear about as much resemblance to the human brain as an aeroplane bears to a bird. Both fly. The mechanisms are entirely different.
What matters in practice is the architecture: how the layers are connected, what operations each layer performs, and how gradients flow during training. Convolutional networks excel at spatial data. Recurrent networks handle sequences (though transformers have largely replaced them). Transformers use attention mechanisms to process all positions in a sequence simultaneously. The choice of architecture determines what kinds of patterns the network can learn efficiently.
RAMageddon
The AI industry’s appetite for memory chips has created supply pressure across the entire semiconductor market. Training and running large models requires enormous quantities of high-bandwidth memory (HBM), and the companies building AI infrastructure are purchasing it in volumes that strain global manufacturing capacity.
The downstream effects are tangible. Console manufacturers have raised prices. Smartphone shipments have contracted. Enterprise buyers face longer lead times and higher costs for server memory. The root cause is straightforward: DRAM and HBM fabrication capacity is finite, and AI demand has grown faster than fab capacity can expand.
This is not a temporary blip. Building new semiconductor fabrication capacity takes years and billions in capital expenditure. AI demand shows no sign of plateauing. The supply constraint will persist until fabrication catches up, and the companies with the deepest pockets (the same hyperscalers driving the demand) are locking in supply contracts that further squeeze everyone else.
Training
Training is the process of feeding data through a neural network and adjusting its weights to minimise a loss function. Before training, the model is merely random noise consisting of millions or billions of parameters set to arbitrary values. After training, those parameters encode the patterns present in the training data.
The process works through backpropagation. The model makes a prediction, the loss function measures how wrong it is, and the error signal propagates backwards through the network, adjusting each weight by a small amount in the direction that would have reduced the error. Repeat this across billions of data points, and the model converges on a set of weights that produce useful outputs.
The cost scales with three factors, namely model size, where more parameters mean more computation per training step; dataset size, where more data means more steps; and training duration, where more passes over the data can improve quality but with diminishing returns. Current frontier models train on trillions of tokens across thousands of GPUs for weeks. The electricity consumption alone rivals that of small towns.
Much of the discourse misses a key point because the model does not actually understand its training data in a human sense, It identifies statistical regularities. The difference matters when you are deciding how much to trust its outputs.
Tokens
Tokens are the atomic units of text that LLMs process. A token might be a word, a subword, a single character, or a punctuation mark, depending on the tokeniser. Most modern tokenisers use subword algorithms (like byte-pair encoding) that split common words into single tokens and rare words into multiple tokens.
The practical implication is cost. AI providers charge per token, both input and output. A prompt that sends 1,000 tokens and receives 500 tokens in response costs you for 1,500 tokens. Reasoning models that generate internal chain-of-thought traces consume additional tokens that you may or may not be billed for, depending on the provider.
Context windows are also measured in tokens. A model with a 128k context window can process roughly 100,000 words in a single session. But context window size and effective context utilisation are different things. Most models degrade in performance when relevant information is buried in the middle of a very long context (a well-documented phenomenon called “lost in the middle”). The number on the specification sheet tells you the maximum. It does not tell you how well the model will use it.
Transfer learning
Transfer learning uses a model trained on one task as the starting point for training on a different but related task. Instead of initialising weights randomly, you start with weights that already encode useful representations of language, images, or whatever domain the original model was trained in.
This is the foundation of practically every commercial AI application built in the last five years. Very few companies train models from scratch. They take a pre-trained base model and adapt it (via fine-tuning or other methods) to their specific use case. The pre-trained weights provide a head start that reduces the data, compute, and time required to reach acceptable performance.
The limitation is domain distance. Transfer learning works well when the source and target tasks share structural similarities. A language model trained on English text transfers well to email classification. It transfers less well to protein folding. The further the target domain diverges from the source domain, the less useful the pre-trained weights become, and the more task-specific data you need to bridge the gap.
Weights
Weights are the numerical parameters that define a trained neural network. Every connection between neurons in the network carries a weight, and the set of all weights collectively determines the model’s behaviour such as what it pays attention to in the input, how it transforms data through its layers, and what outputs it produces.
Before training, weights are initialised randomly. During training, they are adjusted through backpropagation to reduce prediction error. After training, the weights are fixed and define the model. When someone distributes a “model”, they are distributing a file containing these weights (along with the architecture specification needed to load them).
The number of weights in a model is its parameter count. GPT-4 is reported to use over a trillion parameters across a mixture-of-experts architecture. Llama 3 comes in variants from 8 billion to 405 billion parameters. More parameters generally means more capacity to encode patterns, but also more compute to train and run, more memory to store, and higher inference costs. The relationship between parameter count and actual capability is not linear. Architecture, training data quality, and training methodology all matter as much as, and sometimes more than, raw scale.