AI Glossary

Your essential guide to Artificial Intelligence terminology. Explore definitions for machine learning, neural networks, LLMs, and more.

🤖160 terms🧠A to Z coverage

A16 terms

Accuracy

The proportion of correct predictions made by a classification model. It is calculated as (true positives + true negatives) divided by the total number of predictions. Although intuitive, accuracy can be misleading for class‑imbalanced datasets and should be considered alongside precision and recall.

A/B Testing

A statistical technique for comparing two or more variants (A and B) to determine which performs better. In machine learning, it often compares model versions by randomly assigning users to different treatments and measuring a chosen metric, then assessing whether observed differences are statistically significant.

Ablation Study

An experimental method for assessing the importance of a model component or feature by removing it and retraining or evaluating the model. If performance degrades after removal, the component was important. Ablation studies help diagnose contributions of features, layers or preprocessing steps.

Activation Function

A mathematical function applied to the output of a neuron in a neural network to introduce non‑linearity. Common activation functions include sigmoid, tanh and ReLU. They allow neural networks to model complex relationships beyond linear combinations of inputs.

Active Learning

An approach where the model actively selects which data points should be labeled next. By querying an oracle (e.g., a human annotator) for labels on uncertain or informative samples, active learning aims to achieve comparable accuracy with fewer labeled examples.

AdaBoost

An ensemble algorithm that combines multiple weak learners (often decision stumps) into a strong classifier. AdaBoost iteratively adjusts the weights of training instances, focusing on those misclassified by previous learners, and aggregates the learners’ predictions through a weighted majority vote.

Adversarial Example

A specially crafted input designed to deceive a machine‑learning model into making incorrect predictions. Adversarial examples often involve small, imperceptible perturbations to legitimate data that cause neural networks to misclassify, highlighting vulnerabilities in model robustness.

Adversarial Machine Learning

The study of how to train models that are robust to malicious attacks and how to exploit weaknesses in models. It covers adversarial attacks, defenses, robustness testing and secure model deployment, particularly relevant in security‑sensitive applications such as malware detection and autonomous vehicles.

Agent

An autonomous or semi‑autonomous entity capable of perceiving its environment, making decisions and acting to achieve goals. AI agents can automate workflows, call tools or APIs, and coordinate multi‑step tasks. In multi‑agent systems, agents interact or cooperate to accomplish complex objectives.

AGI (Artificial General Intelligence)

A hypothetical form of artificial intelligence that can perform any intellectual task at or beyond human level across a broad range of domains. AGI remains speculative; current AI systems are considered “narrow” because they excel only in specific tasks.

Alignment

The process of ensuring that an AI system’s behaviour aligns with human values, intentions and safety requirements. Alignment research covers both technical techniques (e.g., reward modelling, Constitutional AI) and governance practices, aiming to avoid unintended or harmful model behaviour.

AlphaGo

An AI program developed by DeepMind that defeated world champion Go players. It combines deep neural networks with Monte‑Carlo tree search and reinforcement learning. AlphaGo’s success showcased the potential of deep reinforcement learning for complex sequential decision problems.

Attention Mechanism

A neural network component that learns to focus on the most relevant parts of an input sequence when generating an output. Attention allows models such as transformers to weigh different tokens dynamically, improving performance on language, vision and multimodal tasks.

Autoencoder

A type of neural network that learns to compress input data into a latent representation and then reconstruct the original input from this representation. Autoencoders are used for dimensionality reduction, denoising, pretraining and generative modelling.

Automatic Speech Recognition (ASR)

Technology that converts spoken language into text. ASR systems combine acoustic modelling, language modelling and decoding algorithms to interpret audio signals. Applications include voice assistants, transcription services and voice‑controlled interfaces.

AutoML (Automated Machine Learning)

Systems that automate the selection, training and tuning of machine‑learning models. AutoML tools search over model architectures, hyperparameters and preprocessing pipelines to produce competitive models with minimal human intervention, simplifying model development for non‑experts.

B13 terms

Backpropagation

An algorithm used to train neural networks by computing gradients of the loss function with respect to each weight via the chain rule. Gradients are propagated backward from the output to earlier layers, enabling weight updates through optimization methods like stochastic gradient descent.

Bagging (Bootstrap Aggregating)

An ensemble technique that trains multiple models on different bootstrap samples of the training data and averages their predictions to reduce variance and improve robustness. Random forests are a common bagging method for decision trees.

Batch Normalization

A technique that normalizes the inputs to each layer in a neural network during training. By maintaining zero mean and unit variance within mini‑batches, batch normalization stabilizes learning, allows higher learning rates and acts as a form of regularization.

Bayes’ Theorem

A fundamental result in probability theory that relates conditional probabilities: P(A | B) = [P(B | A) × P(A)] / P(B). In machine learning, Bayes’ theorem underpins Bayesian inference, allowing posterior probabilities to be updated as new data arrive.

Bayesian Inference

A statistical approach that treats parameters as random variables with prior distributions. Upon observing data, the posterior distribution combines prior beliefs and likelihood. Bayesian methods support uncertainty quantification and regularization in complex models.

Beam Search

A heuristic search algorithm used in sequence prediction tasks (e.g., machine translation). Beam search explores only the top‑k most promising partial sequences at each step, trading off completeness for efficiency when finding high‑probability outputs.

Benchmark

Standardized datasets, tasks or metrics used to evaluate and compare models. Benchmarks like ImageNet, GLUE, MMLU or HumanEval allow researchers to gauge progress across models and help identify strengths and weaknesses in specific capabilities.

Bias (Statistical)

Systematic deviation of an estimator’s expected value from the true parameter. Bias can arise from measurement errors, model assumptions or data imbalance. In AI ethics, bias refers to unfair or discriminatory outcomes stemming from imbalanced training data or biased design choices.

Bias–Variance Trade‑off

The fundamental tension between a model’s tendency to underfit (high bias) and overfit (high variance). Increasing model complexity reduces bias but may increase variance. Good generalization requires balancing these opposing sources of error through model choice and regularization.

Binary Classification

A classification problem with exactly two possible labels (e.g., spam versus not spam). Common evaluation metrics include accuracy, precision, recall, F1 score and ROC–AUC. Models include logistic regression, support vector machines and neural networks.

BLEU (Bilingual Evaluation Understudy)

A metric for evaluating machine‑translated text by comparing n‑gram overlap between the candidate translation and one or more reference translations. BLEU scores range from 0 to 1, with higher values indicating closer correspondence to the references.

Boltzmann Machine

A network of stochastic binary units that learn probability distributions over binary vectors. Boltzmann machines are used for generative modelling and representation learning; training is challenging due to the need for sampling from complex energy landscapes.

Bootstrap

A resampling method that draws samples with replacement from a dataset to estimate variability. Bootstrap methods provide confidence intervals, variance estimates and support techniques like bagging.

C21 terms

Calibration (Model)

The degree to which a model’s predicted probabilities reflect true likelihoods. A well‑calibrated classifier outputs probability estimates that match observed frequencies (e.g., events predicted at 70 % occur approximately 70 % of the time). Techniques like Platt scaling or isotonic regression improve calibration.

Capsule Network

A neural network architecture that uses groups of neurons (capsules) to model hierarchical relationships and preserve spatial information. Capsule networks aim to improve robustness to rotations and translation compared with standard convolutional networks by encoding pose parameters.

Categorical Cross‑Entropy

A loss function used for multi‑class classification. It measures the difference between the true one‑hot label distribution and the predicted probability distribution. Minimizing cross‑entropy encourages the model to assign high probability to the correct class.

Chain‑of‑Thought Prompting

A prompting strategy for large language models that encourages the model to reason step‑by‑step before answering. By explicitly asking for intermediate reasoning, chain‑of‑thought prompts can improve accuracy on complex questions and reduce hallucination.

Chatbot

A computer program designed to simulate human conversation through text or voice. Chatbots rely on natural language processing to parse user inputs and generate responses. They power virtual assistants, customer‑support bots and domain‑specific conversational agents.

Chunking (Text or Data)

The process of dividing long texts or sequences into smaller, manageable segments (chunks) for processing. In retrieval‑augmented systems, documents are split into chunks that can be embedded, indexed and retrieved based on relevance to a query.

Class Imbalance

A situation where certain classes appear much less frequently than others. Class imbalance can degrade model performance because many algorithms assume balanced data. Remedies include resampling, weighted loss functions and specialized algorithms.

Classification

The task of assigning input data to one of several discrete categories. Algorithms include logistic regression, decision trees, naïve Bayes, support vector machines and neural networks. Multi‑class and multi‑label classification extend this notion to multiple classes or simultaneous labels.

Clustering

An unsupervised learning technique that groups data points based on similarity. Common methods include k‑means, hierarchical clustering and DBSCAN. Clustering is used for customer segmentation, anomaly detection and exploratory data analysis.

CNN (Convolutional Neural Network)

A neural network architecture designed to process grid‑structured data such as images or audio spectrograms. CNNs use convolutional layers, pooling layers and nonlinear activations to learn spatial hierarchies of features. They dominate computer vision tasks like image classification and object detection.

Code Interpreter / Function Calling

An interface that allows language models to execute functions or code to augment their capabilities. By calling external tools or functions, models can perform operations like math, data retrieval or API requests, improving accuracy and reducing hallucination.

Cold Start

A challenge in recommendation systems where insufficient data are available about new users or items. Cold‑start methods leverage content features, demographic information or transfer learning to make initial recommendations.

Computational Graph

A directed graph representing the sequence of operations (nodes) and data dependencies (edges) in a computation. Frameworks like TensorFlow and PyTorch build computational graphs to automatically compute gradients via backpropagation.

Confusion Matrix

A tabular summary of model performance for classification problems. Rows correspond to true classes and columns to predicted classes. From the matrix one can derive metrics such as accuracy, precision, recall, specificity and F1 score.

Context Window

In large language models, the maximum number of tokens the model can consider at once when generating a response. A longer context window allows the model to maintain “memory” across longer conversations or documents.

Contrastive Learning

A self‑supervised learning technique where the model learns to pull related examples closer in representation space while pushing unrelated examples apart. Methods like SimCLR and CLIP use contrastive objectives to learn high‑quality embeddings without labeled data.

Convergence

In optimization, the point at which further training or iterations produce negligible changes in the loss function. Convergence criteria guide stopping conditions and help diagnose issues like vanishing gradients or overfitting.

Cost Function

Another term for loss function: a measure of how well a model’s predictions match the ground truth. Optimization algorithms aim to minimize the cost function by adjusting model parameters.

Cross‑Entropy Loss

A loss function commonly used for classification that measures the difference between the true probability distribution and the predicted distribution. Binary cross‑entropy applies to two classes; categorical cross‑entropy applies to multi‑class problems.

Cross‑Validation

A technique for assessing how a model generalizes to unseen data. The dataset is split into multiple folds; the model is trained on some folds and validated on the remaining fold. Repeating across all folds reduces bias in performance estimates.

Curriculum Learning

A training strategy where models are presented with easier examples first, gradually introducing harder examples. This mimics human learning and can lead to faster convergence and improved performance, especially in reinforcement learning and language modelling.

D12 terms

Data Augmentation

Techniques for artificially expanding a dataset by applying transformations like rotations, flips, noise addition or cropping. Data augmentation increases diversity, helps prevent overfitting and improves model generalization, especially in computer vision and audio tasks.

Data Drift

Changes in the statistical properties of input data over time that cause a model’s performance to degrade. Monitoring for data drift is a key MLOps task; techniques include population stability indexing, adaptive re‑training and drift detection algorithms.

Data Imputation

The process of filling in missing values within a dataset. Methods range from simple strategies (mean or median substitution) to sophisticated approaches like k‑nearest neighbours, multiple imputation or model‑based imputation.

Data Leakage

The unintentional use of information in training that would not be available at inference time. Leakage can inflate performance metrics and lead to poor generalization. Careful pipeline design and proper cross‑validation prevent data leakage.

Data Pipeline

A series of steps that process raw data into a usable form for modelling. Pipelines include data ingestion, cleaning, transformation, feature extraction and storage. Automating pipelines is part of MLOps.

Deep Learning

A subfield of machine learning that uses neural networks with many layers to model complex patterns. Deep learning underlies breakthroughs in image recognition, speech processing, natural language understanding and generative modelling.

Deep Reinforcement Learning

An area combining deep neural networks with reinforcement learning. Agents learn policies or value functions by interacting with an environment and receiving rewards. Deep RL powers systems like AlphaGo, game‑playing bots and robotics controllers.

Denoising Autoencoder

An autoencoder trained to reconstruct clean inputs from noisy versions. By learning to remove noise, denoising autoencoders build robust representations and serve as a pretraining technique for feature extraction.

Density Estimation

The task of modelling the probability distribution of a dataset. Methods include kernel density estimation, Gaussian mixture models and autoregressive models. Density estimates are used for anomaly detection, compression and generative modelling.

Diffusion Model

A generative model that gradually adds noise to data and then learns to reverse this process. Trained to denoise, diffusion models can generate high‑quality images, text or other modalities by sampling from the learned reverse diffusion process.

Discriminative Model

A model that learns the decision boundary directly by modelling P(y | x), the conditional probability of labels given inputs. Examples include logistic regression, support vector machines and neural networks. Discriminative models contrast with generative models, which model the joint distribution P(x,y).

Dropout

A regularization technique where random units or connections in a neural network are temporarily “dropped” during training. Dropout prevents co‑adaptation of neurons, reduces overfitting and acts as an ensemble of thinned networks.

E9 terms

Early Stopping

A regularization method that stops training when a monitored metric (e.g., validation loss) ceases to improve. Early stopping prevents overfitting by halting training before the model memorizes noise.

Edge AI

The deployment of AI models directly on edge devices such as smartphones, IoT sensors or embedded hardware. Edge AI reduces latency, preserves data privacy and can operate without continuous internet connectivity. Hardware accelerators like TPUs or NPUs support efficient edge inference.

Embedding

A vector representation of discrete items (words, images, nodes) in a continuous space such that similar items are mapped to nearby vectors. Embeddings enable models to capture semantic relationships and are fundamental to retrieval‑augmented systems, word2vec and graph neural networks.

Encoder–Decoder Architecture

A neural architecture consisting of an encoder that maps input data to a latent representation and a decoder that generates outputs from this representation. Used in machine translation, summarization and autoencoders, it allows variable‑length inputs and outputs.

Ensemble Method

Combining multiple models to improve predictive performance and robustness. Techniques include bagging, boosting and stacking. Ensembles often outperform single models by averaging out errors and capturing diverse hypotheses.

Epoch

One complete pass through the entire training dataset during model training. Multiple epochs are often required for convergence; too many epochs can lead to overfitting if not mitigated by early stopping or regularization.

Evaluation Metric

Quantitative criteria for assessing model performance. Common metrics include accuracy, precision, recall, F1 score, ROC–AUC, BLEU and perplexity. The choice of metric depends on the task, dataset imbalance and goals.

Evolutionary Algorithm

A family of optimization methods inspired by natural selection. Candidate solutions are encoded as chromosomes, and operators like mutation, crossover and selection evolve a population towards better fitness. Genetic algorithms and genetic programming are examples.

Explainable AI (XAI)

A set of techniques and tools that provide transparency into how AI models make decisions. Explainability methods include feature importance, saliency maps, SHAP values and counterfactual explanations; they enable trust, compliance and debugging.

F10 terms

Fairness

The principle that an AI system should produce equitable outcomes across different demographic groups. Fairness metrics include demographic parity, equal opportunity and equalized odds. Techniques such as re‑sampling, adversarial debiasing and fairness constraints help mitigate biases.

Federated Learning

A distributed learning paradigm where models are trained across multiple devices or organizations without sharing raw data. Each participant computes local updates; a central server aggregates them to improve a global model, preserving privacy.

Feature

A measurable variable used as input to a machine‑learning model. Features can be numerical (e.g., age), categorical (e.g., colour) or derived (e.g., embeddings). Feature engineering involves creating, transforming and selecting features to improve model performance.

Feature Engineering

The process of creating informative input variables from raw data. It may involve transformation (e.g., log scaling), interaction terms, encoding categorical variables and deriving domain‑specific metrics. Good feature engineering can markedly improve model accuracy.

Feature Importance

A measure of how much each feature contributes to a model’s predictions. Techniques include Gini importance, permutation importance and SHAP values. Understanding feature importance aids interpretability and identifies potential biases.

Feature Store

Centralized infrastructure to store, manage and serve features consistently across training and inference. Feature stores enable feature reuse, versioning and governance in production ML pipelines and are a core component of MLOps.

Few‑Shot Learning

Learning a new task from a small number of examples by leveraging prior knowledge or meta‑learning. Few‑shot techniques allow rapid adaptation and are widely used in LLMs for tasks like classification or instruction following.

F1 Score

The harmonic mean of precision and recall, computed as 2 × (precision × recall)/(precision + recall). The F1 score balances false positives and false negatives and is useful when classes are imbalanced.

Fine‑Tuning

Adapting a pre‑trained model to a specific downstream task by continuing training on a smaller, task‑specific dataset. Fine‑tuning updates the model’s parameters but starts from a well‑trained initialization, enabling better performance with less data.

Foundation Model

Large AI models trained on broad data using self‑supervised learning. They can be adapted (via fine‑tuning or prompting) to perform numerous downstream tasks. Examples include GPT‑4, PaLM and Llama.

G8 terms

GAN (Generative Adversarial Network)

Consists of a generator network that produces synthetic data and a discriminator network that distinguishes between real and synthetic data. The two networks play a minimax game, leading the generator to create increasingly realistic samples.

Generalization

The ability of a model to perform well on new, unseen data. A well‑generalized model captures underlying patterns rather than memorizing the training data. Techniques like regularization, cross‑validation and complexity control help improve generalization.

Generative AI

Models that produce new content (text, images, audio) by learning patterns from training data. Techniques include autoregressive models (GPT), diffusion models and VAEs. Generative AI powers chatbots, text‑to‑image tools and music generation.

Gradient Descent

An optimization algorithm that iteratively updates model parameters in the opposite direction of the gradient of the loss function. Variants include stochastic gradient descent (SGD), mini‑batch GD, momentum and adaptive methods like Adam.

Graph Neural Network (GNN)

A neural network designed to operate on graph‑structured data by aggregating information from neighbouring nodes. GNNs excel at tasks like social‑network analysis, molecule property prediction and recommendation systems.

Grounding

In retrieval‑augmented generation, the process of linking generated content to specific, verifiable sources. Grounding ensures that model outputs are supported by retrieved evidence, reducing hallucination and improving factual accuracy.

GPU (Graphics Processing Unit)

A specialized processor originally designed for graphics rendering but widely used for deep learning because of its ability to perform parallel computations. GPUs accelerate training and inference for large neural networks.

Guardrails

Controls designed to prevent AI systems from producing harmful or unsafe outputs. Guardrails include input validation, content filters, red‑teaming and rule‑based constraints that restrict certain behaviour.

H5 terms

Hallucination

The phenomenon in which a language model generates content that is factually incorrect, unsupported by the input or outright fabricated. Hallucinations arise due to limited context, training data gaps or model misalignment.

Hidden Layer

Layers of neurons between the input and output layers in a neural network. Hidden layers learn hierarchical representations of the data. Their depth and width influence model capacity and computational cost.

Human-in-the-Loop

A model development process where human feedback is integrated during training or inference. It improves model accuracy, safety, and alignment, often used in active learning or reinforcement learning from human feedback (RLHF).

Hyperparameter

A configuration value that defines model structure or training behaviour (e.g., learning rate, number of layers, regularization strength). Hyperparameters are not learned from data and must be set through tuning or domain knowledge.

Hyperparameter Tuning

The process of selecting optimal hyperparameters using search strategies like grid search, random search, Bayesian optimization or population‑based methods. Proper tuning can substantially improve model performance.

I3 terms

Inference (Model)

The process of using a trained model to make predictions on new, unseen data. Inference may occur in batch mode or in real time and often requires optimizations for latency and throughput.

In‑Context Learning

The ability of large language models to learn new tasks from prompts containing examples, without updating model weights. By conditioning on demonstrations in the context window, models adapt their behaviour to perform the desired task.

Interpretability

The degree to which a model’s internal mechanisms and decisions can be understood by humans. Interpretability methods include model simplification, feature importance analysis and surrogate models. Higher interpretability supports trust and debugging.

K3 terms

Knowledge Base

A structured repository of facts or rules used by AI systems. Knowledge bases support reasoning, answer retrieval and retrieval‑augmented generation. Examples include Wikidata and enterprise knowledge graphs.

Knowledge Distillation

A technique for transferring knowledge from a larger “teacher” model to a smaller “student” model. The student learns to match the teacher’s output distributions, achieving similar performance with fewer parameters and reduced computational cost.

Knowledge Graph

A type of knowledge base represented as a graph of entities and their relationships. Knowledge graphs enable semantic search, recommendation systems and integration with language models through retrieval augmentation.

L6 terms

Large Language Model (LLM)

A deep learning algorithm that can recognize, summarize, translate, predict, and generate text and other content based on knowledge gained from massive datasets. Examples include GPT-4, Claude, and Llama.

Latent Space

An abstract representation where complex, high‑dimensional data are encoded into a lower‑dimensional manifold. Latent spaces allow models to interpolate, sample and perform arithmetic on abstract features (e.g., vector arithmetic in word embeddings).

Learning Rate

A hyperparameter that controls the step size in gradient‑based optimization. Too high a learning rate can cause divergence; too low slows convergence. Adaptive learning rate methods like Adam adjust step sizes per parameter.

LIME (Local Interpretable Model‑agnostic Explanations)

An explanation technique that fits a simple, interpretable model locally around a specific prediction to approximate how a complex model behaves. LIME provides human‑understandable reasons for individual predictions.

Long Short‑Term Memory (LSTM)

A recurrent neural network architecture designed to capture long‑range dependencies by using gating mechanisms (input, forget and output gates). LSTMs excel at processing sequential data like language, speech and time series.

Loss Function

A quantitative measure of model error used to guide learning. Examples include mean squared error, cross‑entropy and hinge loss. Optimization algorithms aim to minimize the loss function.

M6 terms

Machine Learning (ML)

The study of algorithms that learn patterns from data to make predictions or decisions without being explicitly programmed. ML encompasses supervised, unsupervised, semi‑supervised and reinforcement learning, and underpins modern AI applications.

Meta Prompt / System Prompt

A hidden set of instructions that guides a language model’s behaviour before user input is processed. System prompts define tone, role or constraints (e.g., “You are a helpful assistant”) and influence subsequent responses.

Mixture of Experts

A model architecture that combines multiple specialized submodels (experts) and a gating network that routes inputs to appropriate experts. Mixture‑of‑experts models enable scaling of parameter counts while controlling computational cost.

Model Card

A standardized document describing a machine‑learning model’s intended use, performance, limitations, ethical considerations and training data characteristics. Model cards support transparency, accountability and regulatory compliance.

Model Drift

Degradation in model performance over time due to changes in data distribution or user behaviour. Detecting drift triggers retraining or model updates in production pipelines.

Multimodal Model

Models that process and integrate multiple data types (e.g., text, images, audio). Multimodal models enable tasks like image captioning, visual question answering and audio‑visual speech recognition.

N3 terms

Natural Language Processing (NLP)

A field of AI focused on enabling machines to understand, interpret and generate human language. NLP techniques include tokenization, parsing, named‑entity recognition, sentiment analysis, machine translation and summarization.

Neural Architecture Search (NAS)

Automated methods for discovering optimal neural network structures. NAS explores combinations of layers, connections and hyperparameters using search strategies like reinforcement learning, evolutionary algorithms or gradient‑based methods.

Neural Network

A computational model composed of interconnected layers of artificial neurons. Neural networks learn complex functions by adjusting weights and biases through training. Architectures include feedforward, recurrent, convolutional and transformer networks.

O4 terms

One‑Hot Encoding

A method for representing categorical variables as binary vectors. Each category corresponds to a vector with a single “1” at the index representing that category and “0”s elsewhere. One‑hot encoding enables categorical data to be used in models that require numerical inputs.

OpenAI

An AI research organization and technology company known for developing large language models like GPT-4 and image generation models like DALL-E. They play a significant role in the current generative AI boom.

Optimizer

An algorithm that updates model parameters to minimize the loss function. Optimizers include gradient descent, Adam, RMSprop and Adagrad. Choice of optimizer affects convergence speed and stability.

Overfitting

When a model learns patterns specific to the training data, including noise, resulting in poor generalization. Overfitting is mitigated through regularization, cross‑validation, simpler models, early stopping and data augmentation.

P8 terms

Parameter

A quantity in a model that is learned from data during training. Parameters include weights and biases in neural networks. They differ from hyperparameters, which are set externally.

Perceptron

One of the earliest artificial neural networks, consisting of a linear combination of inputs followed by a threshold activation. Although limited to linearly separable problems, the perceptron paved the way for modern deep learning.

Perplexity

A measurement of how well a probabilistic language model predicts a sample. Lower perplexity indicates better performance. It is the exponential of the cross‑entropy per token and is widely used in natural language modelling.

Pooling

Down‑sampling operations (max, average or global) applied in convolutional neural networks to reduce spatial dimensions and aggregate local features. Pooling improves translational invariance and reduces computational load.

Precision

In classification, the proportion of predicted positives that are truly positive: TP / (TP + FP). Precision quantifies how often positive predictions are correct and is important when false positives are costly.

Pretraining

Training a model on a large, general dataset using self‑supervised or unsupervised objectives before fine‑tuning on a specific task. Pretraining enables models to learn generic representations that transfer to downstream tasks.

Prompt Engineering

The art and science of crafting prompts that guide language models to produce desired outputs. It involves setting roles, providing context, specifying formats and controlling sampling parameters, as well as avoiding prompt injection or jailbreak attacks.

Prompt Injection

A security vulnerability where malicious instructions are inserted into user inputs or retrieved content to override or subvert a model’s system prompt. Prompt injection attacks can cause LLMs to ignore safety guardrails or leak sensitive data. Mitigation includes input sanitization and strict prompt isolation.

R7 terms

RAG (Retrieval‑Augmented Generation)

A technique that combines a language model with an external knowledge retrieval mechanism. Relevant documents are retrieved, embedded and provided as context to the language model, grounding responses and reducing hallucination.

Recall

In classification, the proportion of true positives correctly identified out of all actual positives: TP / (TP + FN). Recall measures how well the model captures all positive instances.

Recommender System

An algorithmic system that suggests items (products, media) to users based on preferences, behaviour or similarity to other users/items. Techniques include collaborative filtering, content‑based filtering and hybrid methods.

Regularization

Techniques that discourage overly complex models to improve generalization. Common forms include L1 and L2 weight penalties, dropout, early stopping and data augmentation. Regularization mitigates overfitting and reduces variance.

Reinforcement Learning (RL)

A learning paradigm where an agent interacts with an environment, receives rewards and learns to select actions that maximize expected cumulative reward. RL encompasses policy iteration, value iteration, Q‑learning and actor–critic methods.

Recurrent Neural Network (RNN)

A class of neural networks tailored for processing sequential data. In an RNN, connections between nodes form a directed graph along a temporal sequence, allowing it to exhibit temporal dynamic behavior.

Reinforcement Learning from Human Feedback (RLHF)

A machine learning approach where a reward model is trained based on human feedback, and this reward model is then used to optimize an agent's policy via reinforcement learning. It is notably used to align large language models like GPT-4.

S7 terms

Self‑Attention

A mechanism in which each position in a sequence attends to all positions, weighting them based on relevance. Self‑attention enables transformers to capture long‑range dependencies and is foundational to modern language models.

Self‑Supervised Learning

Learning representations from unlabeled data by solving auxiliary tasks (e.g., predicting masked tokens, image rotations). Self‑supervised learning reduces reliance on labeled data and often precedes supervised fine‑tuning.

Sequence‑to‑Sequence (Seq2Seq)

A model architecture that maps an input sequence to an output sequence, often using encoder–decoder networks with attention. Seq2Seq models power machine translation, summarization and text generation.

Softmax Function

A function that converts a vector of real numbers into a probability distribution over classes. The softmax is differentiable and is used in the output layer of multi‑class classifiers and attention mechanisms.

Stochastic Gradient Descent (SGD)

An iterative optimization method that updates parameters using a single sample or mini‑batch rather than the entire dataset. SGD reduces computation per iteration and introduces noise that can help escape local minima.

Support Vector Machine (SVM)

A supervised learning algorithm that finds the hyperplane that maximally separates classes in feature space. With kernel functions, SVMs handle nonlinear boundaries. SVMs also support regression (SVR).

Swarm Intelligence

Decentralized, self‑organizing collective behaviour inspired by biological swarms (ants, bees, birds). Algorithms like particle swarm optimization and ant colony optimization solve optimization problems by leveraging the collective intelligence of many agents.

T8 terms

Temperature (Sampling)

A parameter used during sampling from language models that controls randomness. Lower temperatures produce more deterministic outputs by sharpening probability distributions; higher temperatures encourage diversity by flattening distributions. Temperature complements top‑k or top‑p sampling.

Tensor

A multi‑dimensional array of numerical values. Tensors generalize scalars (0‑D), vectors (1‑D) and matrices (2‑D) to higher dimensions and are the primary data structure in deep learning frameworks.

TensorFlow

An open‑source machine‑learning framework developed by Google. TensorFlow provides tools for building, training and deploying models, including automatic differentiation, Keras integration and support for distributed training.

Token

A basic unit of input processed by a language model. Tokens can be words, subwords or characters depending on the tokenizer. Tokenization splits text into tokens and maps them to numeric IDs.

Training

The process of adjusting model parameters using labeled or unlabeled data to minimize a loss function. Training involves forward propagation, loss computation, backpropagation and parameter updates. Adequate training requires sufficient data, proper hyperparameters and regularization.

Transformer

A neural network architecture based solely on attention mechanisms, without recurrent or convolutional operations. Transformers process input sequences in parallel, enabling efficient modelling of long‑range dependencies. They form the basis of models like BERT, GPT and T5.

Transfer Learning

Leveraging knowledge learned from one task or domain to improve performance on another. Transfer learning often involves fine‑tuning a pre‑trained model on a related task, reducing data requirements and training time.

Turing Test

A test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Proposed by Alan Turing in 1950, it involves a human evaluator engaging in natural language conversations with a human and a machine.

U2 terms

Underfitting

When a model is too simple to capture underlying patterns in the data, resulting in poor performance on both training and test sets. Underfitting can be addressed by increasing model complexity or feature engineering.

Unsupervised Learning

Learning patterns from unlabeled data. Unsupervised learning includes clustering, dimensionality reduction, density estimation and representation learning. It uncovers structure without explicit target labels.

V4 terms

Variational Autoencoder (VAE)

A generative model that learns a probabilistic latent space. VAEs encode inputs into a distribution over latent variables and decode samples back to the data space. They enable controlled generation and representation learning.

Vector Database

A database optimized for storing and searching high‑dimensional vectors. Vector databases support similarity search by indexing embeddings, enabling fast retrieval for recommendation and retrieval‑augmented generation.

Vector Embedding

Another term for embedding: a vector representation that captures relationships between items. Vector embeddings underpin retrieval, clustering and nearest‑neighbour search.

Vision Transformer (ViT)

A transformer architecture adapted for image tasks. Patches of the image are treated as tokens, and self‑attention processes spatial relationships. ViT models achieve competitive results in image classification and segmentation.

W2 terms

Weight

A parameter in a neural network that scales the input signal. Adjusting weights during training allows the network to learn complex functions. Weights are updated through gradient‑based optimization.

Word Embedding

An embedding specifically for words, mapping each word to a dense vector. Word embeddings like word2vec, GloVe and fastText capture semantic relationships and improve performance in NLP tasks.

X1 term

XAI (Explainable AI)

An acronym for explainable AI, emphasizing the need for transparency in AI models. XAI techniques seek to make model decisions understandable to humans through explanations, feature attributions and interpretable proxies.

Y1 term

YOLO (You Only Look Once)

An object‑detection algorithm that predicts bounding boxes and class probabilities in a single forward pass through a neural network. YOLO is known for real‑time performance and has evolved into versions like YOLOv7 and YOLOv8.

Z1 term

Zero‑Shot Learning

The ability of a model to perform tasks it was not explicitly trained on by leveraging general knowledge and representations. Zero‑shot capabilities are common in large language models and vision–language models.