Technology & Computing
AI & Machine Learning
Concepts, milestones, and key ideas in artificial intelligence.
History of AI
- Dartmouth Conference (1956) — the founding event of AI as a field; John McCarthy coined the term “artificial intelligence”; attendees included Marvin Minsky, Claude Shannon, and Nathaniel Rochester.
- Turing Test (1950) — proposed by Alan Turing in “Computing Machinery and Intelligence”; an interrogator communicates by text with a human and a machine; if the interrogator cannot reliably tell them apart, the machine is said to pass; Turing called it the “Imitation Game.”
- Symbolic AI / GOFAI — early paradigm (“Good Old-Fashioned AI”) representing knowledge as explicit rules and symbols; dominated through the 1970s.
- Logic Theorist (1956) — program by Newell, Shaw, and Simon that proved theorems from Principia Mathematica; often called the first AI program.
- General Problem Solver (1957) — Newell and Simon; used means-ends analysis; demonstrated general-purpose problem solving.
- Perceptron (Mark I, 1958) — Frank Rosenblatt built the Mark I Perceptron machine at Cornell; inspired by McCulloch-Pitts neurons; early optimism about machine intelligence.
- Minsky & Papert — Perceptrons (1969) — Marvin Minsky and Seymour Papert’s book proved the single-layer perceptron cannot solve the XOR problem (not linearly separable); widely credited with triggering the first AI winter by damping funding for neural network research.
- DENDRAL (1960s) — Stanford expert system by Edward Feigenbaum, Joshua Lederberg, and Bruce Buchanan; identified organic compounds from mass spectrometry; often cited as the first expert system.
- MYCIN (1972–74) — Stanford expert system by Ted Shortliffe; diagnosed bacterial blood infections and recommended antibiotics; demonstrated expert-level performance; influential model for later knowledge-based systems.
- First AI Winter (late 1970s) — funding cuts following overpromised results and the 1973 Lighthill Report (UK); DARPA scaled back investment.
- Expert systems (1980s) — knowledge-based programs encoding domain rules; examples include MYCIN (medical diagnosis) and XCON (computer configuration); commercial wave peaked before collapsing.
- Second AI Winter (late 1980s–early 1990s) — expert systems proved brittle and expensive to maintain; Lisp machine market collapsed.
- Backpropagation revival (1986) — Rumelhart, Hinton, and Williams published “Learning Representations by Back-propagating Errors” in Nature; popularized backpropagation as a practical training algorithm for multilayer networks and reignited connectionist research.
- Statistical ML renaissance (1990s) — shift from hand-crafted rules toward learning from data; support vector machines and probabilistic methods rose to prominence.
- Deep learning resurgence (2006–) — Hinton et al. showed deep belief networks could be pretrained layer by layer; reignited interest in neural networks.
- ImageNet (2009) — Fei-Fei Li led creation of the ImageNet dataset at Princeton/Stanford: over 14 million labeled images across 20,000+ categories; the annual ILSVRC competition (from 2010) became the benchmark that drove the deep learning revolution.
- ImageNet moment (2012) — AlexNet’s dramatic win on ILSVRC signaled deep learning’s practical superiority; triggered the modern deep learning era.
- Connectionism — the research tradition modeling cognition as emergent from networks of simple interconnected units (neurons); contrasted with symbolic AI; dominated modern deep learning lineage.
- Hopfield Network (1982) — John Hopfield introduced a recurrent network that functions as an associative memory, storing patterns as energy minima; foundational to modern work on energy-based models; contributed to Hopfield sharing the 2024 Nobel Prize in Physics.
- Boltzmann Machine — stochastic recurrent neural network with hidden units; trained by contrastive divergence; introduced by Hinton and Sejnowski (1985); the Restricted Boltzmann Machine (RBM) became a building block of early deep networks.
- NETtalk (1987) — Sejnowski and Rosenberg; a neural network that learned to convert English text to phonemes (speech); widely publicized as a demonstration of connectionist learning.
- TD-Gammon (1992) — Gerald Tesauro’s backgammon program trained via temporal difference (TD) learning; reached expert human level; early proof of concept for RL on complex games.
Machine Learning Fundamentals
- Supervised learning — model trained on labeled input-output pairs; learns a mapping from inputs to targets. Examples: classification, regression.
- Unsupervised learning — no labels; model finds structure in data. Examples: clustering, dimensionality reduction, density estimation.
- Semi-supervised learning — combines a small labeled set with a large unlabeled set; common when labeling is expensive.
- Self-supervised learning — model generates its own supervision from the data (e.g., predicting masked tokens); underpins modern LLMs.
- Reinforcement learning (RL) — an agent takes actions in an environment to maximize cumulative reward; core concepts are state, action, reward, and policy.
- Training, validation, test split — training data fits the model; validation data tunes hyperparameters; test data gives a final unbiased estimate of performance.
- Overfitting / underfitting — overfitting: model memorizes training data, fails to generalize; underfitting: model too simple to capture patterns.
- Bias-variance tradeoff — high bias = underfitting; high variance = overfitting; regularization and ensemble methods balance the two.
- Cross-validation — technique (e.g., k-fold) for estimating generalization by rotating which subset is held out.
- Regularization — techniques that penalize model complexity to reduce overfitting: L1 (Lasso, promotes sparsity), L2 (Ridge, shrinks weights), dropout.
- Feature engineering — transforming raw data into representations that improve model performance; less central in deep learning but critical in classical ML.
- Hyperparameters vs. parameters — parameters are learned from data (weights, biases); hyperparameters are chosen before training (learning rate, number of layers, regularization strength); tuned by grid search, random search, or Bayesian optimization.
- Confusion matrix — table of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN); foundation for computing precision, recall, specificity, and F1.
- No free lunch theorem — David Wolpert (1996): no single learning algorithm performs best across all possible problems; performance depends on the match between algorithm assumptions and the data distribution.
- Occam’s razor in ML — prefer simpler models when performance is comparable; simpler models tend to generalize better and are more interpretable.
Classic Algorithms
- Linear regression — fits a line (or hyperplane) minimizing squared residuals; assumes a linear relationship between features and a continuous target.
- Logistic regression — models the probability of a binary outcome using the logistic (sigmoid) function; output is a probability, decision at a threshold.
- Decision tree — hierarchical binary splits on features to predict a label; interpretable but prone to overfitting; depth and pruning control complexity.
- Random forest — ensemble of decision trees trained on bootstrap samples with random feature subsets; reduces variance via averaging (bagging).
- Gradient boosting — additive ensemble where each tree corrects errors of the previous; implementations include XGBoost, LightGBM, CatBoost; strong on tabular data.
- Support vector machine (SVM) — finds the maximum-margin hyperplane separating classes; the kernel trick maps data to high-dimensional space implicitly, enabling nonlinear boundaries.
- k-Nearest Neighbors (kNN) — classifies a point by majority vote of its k closest training examples; non-parametric; sensitive to feature scale.
- Naive Bayes — applies Bayes’ theorem assuming feature independence; fast and effective for text classification despite unrealistic independence assumption.
- k-Means clustering — partitions data into k clusters by iteratively assigning points to nearest centroid and recomputing centroids; requires specifying k.
- Principal Component Analysis (PCA) — linear dimensionality reduction; finds orthogonal axes of maximum variance; widely used for visualization and preprocessing.
- t-SNE / UMAP — nonlinear dimensionality reduction for visualization; t-SNE (van der Maaten & Hinton, 2008) and UMAP are common for high-dimensional biological and representation data.
- Bagging (Bootstrap Aggregating) — trains multiple models on bootstrap samples and averages predictions; reduces variance; random forests are the canonical bagging ensemble.
- Stacking — meta-learning ensemble that trains a second-level model on the outputs of base models; often outperforms any single base model.
- Expectation-Maximization (EM) algorithm — iterative method for maximum likelihood estimation with latent variables; alternates between E-step (compute expected log-likelihood) and M-step (maximize it); used in Gaussian mixture models and HMMs.
- Gaussian mixture model (GMM) — probabilistic model representing a distribution as a weighted sum of Gaussian components; fitted with EM; soft-assignment clustering.
- Gaussian process (GP) — a distribution over functions; used for Bayesian regression and optimization; provides uncertainty estimates; Bayesian optimization with GPs is standard for hyperparameter tuning.
- DBSCAN — density-based clustering that finds arbitrarily shaped clusters and marks noise points as outliers; requires two parameters (epsilon, minPts); does not require specifying k in advance.
- Anomaly detection — identifying rare observations that deviate markedly from the majority; methods include isolation forests, one-class SVMs, and autoencoders; central to fraud detection and network security.
- Class imbalance / SMOTE — when one class is rare, standard accuracy is misleading; remedies include oversampling (SMOTE synthesizes minority-class examples), undersampling, and class-weighted loss functions.
Neural Networks and Deep Learning
- Perceptron (algorithm) — Frank Rosenblatt’s single-layer linear classifier; foundational unit; the XOR problem (not linearly separable) revealed its limits.
- Multilayer perceptron (MLP) — stacked layers of neurons with nonlinear activations; universal approximation theorem states a sufficiently wide MLP can approximate any continuous function.
- Activation functions — introduce nonlinearity: sigmoid (saturates, vanishing gradients), tanh, ReLU (max(0,x); most common in practice), leaky ReLU, GELU.
- Backpropagation — algorithm for computing gradients of a loss with respect to all weights via the chain rule; enables training of deep networks; formalized by Rumelhart, Hinton, and Williams (1986).
- Gradient descent — iteratively adjusts weights in the direction of the negative gradient. Variants: SGD (stochastic, single examples), mini-batch, Adam (adaptive learning rates; Kingma & Ba, 2014).
- Loss function — measures prediction error; examples: mean squared error (regression), cross-entropy (classification), hinge loss (SVM).
- Vanishing / exploding gradients — deep networks can have gradients shrink to near zero (vanishing) or grow uncontrollably; addressed by careful initialization, batch normalization, skip connections.
- Batch normalization (2015) — normalizes layer inputs within a mini-batch; stabilizes and accelerates training (Ioffe & Szegedy).
- Dropout (2014) — randomly zeroes a fraction of neurons during training; acts as ensemble regularization (Srivastava, Hinton et al.).
- Skip connections / residual connections — introduced in ResNet (He et al., 2015); add the input of a layer directly to its output, creating a shortcut; mitigate vanishing gradients and enable very deep networks.
- Universal approximation theorem — a multilayer perceptron with a single hidden layer of sufficient width can approximate any continuous function on a compact domain; established by Cybenko (1989) and Hornik (1991).
- Weight initialization — random initialization strategies (Xavier/Glorot, He/Kaiming) keep variance stable across layers and prevent vanishing/exploding gradients at the start of training.
- Learning rate schedulers — strategies for varying the learning rate during training (step decay, cosine annealing, warm-up); critical for convergence in large models.
- Softmax function — converts a vector of raw scores (logits) into a probability distribution summing to 1; standard output layer for multi-class classification; temperature scaling alters sharpness.
- Cross-entropy loss — standard loss for classification; measures the divergence between predicted probability distribution and the true (one-hot) label; minimizing it is equivalent to maximizing log-likelihood.
- Exploding gradients and gradient clipping — a counterpart to vanishing gradients common in RNNs; clipping norms of gradients to a maximum value stabilizes training.
- Data augmentation — artificially expanding the training set by applying transformations (flips, crops, color jitter for images; back-translation for text); reduces overfitting and improves generalization.
- Knowledge distillation — training a smaller “student” model to mimic the soft output probabilities of a larger “teacher” model; produces compact models with near-teacher accuracy (Hinton et al., 2015).
- Curriculum learning — training strategy where examples are presented in order from easy to hard; Bengio et al. (2009) showed this can improve convergence and generalization.
Convolutional Neural Networks (CNNs)
- Convolution — applies a learned filter (kernel) sliding across input; captures local patterns (edges, textures) with parameter sharing.
- Pooling — reduces spatial dimensions (max-pool, average-pool); provides translation invariance.
- LeNet-5 (1998) — Yann LeCun’s CNN for digit recognition on MNIST; demonstrated end-to-end learned features.
- VGG / GoogLeNet / ResNet — successive ImageNet-era architectures. ResNet (He et al., 2015) introduced residual (skip) connections, enabling networks of 100+ layers.
- Object detection networks — R-CNN family (Girshick et al.) and YOLO (Redmon et al.) extended CNNs to localize and classify multiple objects in an image simultaneously.
- Vision Transformer (ViT, 2020) — Dosovitskiy et al. (Google Brain); applied the transformer architecture directly to image patches rather than pixels; competitive with CNNs on ImageNet; showed transformers are not NLP-specific.
- AlexNet architecture details — eight learned layers (five convolutional, three fully connected); trained on two GPUs simultaneously; used ReLU and dropout; the architecture paper by Krizhevsky, Sutskever, and Hinton is among the most cited in computer science.
Recurrent Neural Networks (RNNs)
- RNN — processes sequential data by maintaining a hidden state updated at each time step; struggles with long-range dependencies.
- LSTM (Long Short-Term Memory, 1997) — Hochreiter & Schmidhuber; gates (input, forget, output) allow selective memory; dominant sequence model before transformers.
- GRU (Gated Recurrent Unit, 2014) — simplified LSTM variant; fewer parameters; comparable performance on many tasks.
- Sequence-to-sequence (seq2seq) — encoder-decoder architecture for tasks like translation; attention mechanism (Bahdanau, 2015) allowed the decoder to focus on relevant encoder states.
- Hidden Markov Model (HMM) — probabilistic sequence model with hidden states; widely used in speech recognition and bioinformatics before neural approaches dominated; trained with the Baum-Welch (EM) algorithm.
- Conditional Random Field (CRF) — discriminative probabilistic graphical model for structured prediction; widely used for sequence labeling (named entity recognition, part-of-speech tagging); often stacked atop RNNs or transformers.
Transformers and Attention
- Attention mechanism — computes a weighted sum of values, where weights reflect learned relevance between a query and a set of keys; allows modeling long-range dependencies.
- Self-attention — queries, keys, and values all come from the same sequence; each position attends to every other position.
- Multi-head attention — runs several attention operations in parallel, concatenates results; captures different relationship types.
- Transformer (2017) — Vaswani et al. (“Attention Is All You Need”); replaced recurrence entirely with self-attention; enabled massive parallelism during training; underpins nearly all modern LLMs.
- Positional encoding — since transformers lack recurrence, position information is injected via sinusoidal or learned embeddings.
- BERT (2018) — Google; bidirectional encoder pretrained on masked language modeling; fine-tuned for downstream NLP tasks.
- GPT series — OpenAI autoregressive (left-to-right) language models; GPT-2 (2019) raised concerns about misuse; GPT-3 (2020, 175B parameters) showed striking few-shot capability.
- Encoder vs decoder vs encoder-decoder — BERT-style encoders for understanding tasks; GPT-style decoders for generation; T5/BART-style encoder-decoder for seq2seq.
- Layer normalization — normalizes across the feature dimension (not the batch); preferred over batch normalization in transformers because it is batch-size independent.
- InstructGPT / instruction tuning (2022) — OpenAI fine-tuned GPT-3 with RLHF on human-written instruction-response pairs; the direct precursor to ChatGPT; showed that alignment dramatically improves usability.
- T5 (Text-to-Text Transfer Transformer, 2019) — Google (Raffel et al.); unified all NLP tasks as text-in, text-out; pretrained with a span-corruption objective; established a systematic framework for comparing transfer learning approaches.
- Rotary positional embeddings (RoPE) — encodes position by rotating query and key vectors; allows relative position information to be incorporated directly into the attention score; used by LLaMA and many open-source LLMs.
- Flash Attention — Dao et al. (2022); IO-aware exact attention algorithm that tiles computation to reduce GPU memory bandwidth bottlenecks; dramatically speeds transformer training without changing the mathematical result.
- Chain-of-thought prompting — Wei et al. (2022, Google); adding intermediate reasoning steps to few-shot prompts significantly improves multi-step reasoning in large models; sparked a wave of “reasoning trace” methods.
- LLaMA — Meta’s family of open-weight large language models (released 2023–); demonstrated that smaller, efficiently trained models can rival larger proprietary ones; widely used as the base for open-source fine-tuning.
- Constitutional AI / RLAIF — Anthropic approach in which an AI model critiques and revises its own outputs according to a set of principles, reducing the need for large amounts of human preference labels; a variant of alignment via AI feedback.
Embeddings and Representations
- Word embeddings — dense vector representations of words encoding semantic relationships; Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) are classic examples.
- Contextual embeddings — representations that change with context (unlike Word2Vec); produced by BERT, GPT, and similar models.
- Embedding space geometry — relationships encoded as vector arithmetic (e.g., king − man + woman ≈ queen in Word2Vec).
- Tokenization — text split into subword units (Byte Pair Encoding, WordPiece) before embedding; balances vocabulary size and coverage.
- Byte Pair Encoding (BPE) — subword tokenization algorithm that iteratively merges the most frequent adjacent byte pairs; used by GPT models; handles rare and out-of-vocabulary words gracefully.
- Latent space — the compressed internal representation learned by a model (e.g., the middle of an autoencoder or VAE); nearby points in latent space correspond to semantically similar inputs.
- FastText — Facebook AI’s extension of Word2Vec that represents each word as a bag of character n-grams; handles morphologically rich languages and out-of-vocabulary words better than whole-word embeddings.
- Sentence transformers / SBERT — fine-tuned BERT-style models that produce fixed-size sentence embeddings suitable for semantic similarity via cosine distance; widely used in retrieval and clustering.
Generative Models
- Generative Adversarial Network (GAN, 2014) — Ian Goodfellow et al.; a generator and discriminator trained adversarially; generator learns to produce realistic samples; prone to mode collapse and training instability.
- Variational Autoencoder (VAE, 2013) — Kingma & Welling; encodes input to a latent distribution, samples, decodes; enables smooth interpolation in latent space.
- Diffusion models — learn to reverse a gradual noising process; state-of-the-art for image generation (DDPM, Ho et al. 2020; Stable Diffusion, DALL-E 2, Midjourney are prominent systems).
- Autoencoder — neural network trained to encode inputs to a low-dimensional bottleneck and reconstruct them; learns compressed representations; variants include denoising autoencoders and sparse autoencoders.
- Score-based / flow-matching models — related generative frameworks using continuous normalizing flows or score functions; increasingly used in practice.
- DALL-E / CLIP (OpenAI) — CLIP (2021) trained a vision-language model by contrastive learning on image-text pairs; DALL-E used a transformer to generate images from text; DALL-E 2 (2022) switched to diffusion.
- Denoising Diffusion Probabilistic Models (DDPM, 2020) — Ho, Jain, and Abbeel; formalized the diffusion generative framework; forward process adds Gaussian noise; reverse process (a neural network) learns to denoise step by step.
- Normalizing flows — generative models that apply a sequence of invertible transformations to a simple distribution; exact likelihood computation; less scalable than diffusion models for images.
- U-Net — encoder-decoder CNN with skip connections between corresponding encoder and decoder layers; originally designed for biomedical image segmentation (Ronneberger et al., 2015); became the standard backbone for diffusion model image generation.
- Mode collapse (GANs) — a failure mode in which the GAN generator produces only a narrow subset of realistic outputs, ignoring much of the true data distribution; a major stability challenge for GAN training.
- Contrastive learning — self-supervised method that trains representations by pulling representations of similar (positive) pairs together and pushing dissimilar (negative) pairs apart; SimCLR and MoCo are prominent methods; the basis of CLIP.
- Text-to-image generation timeline — DALL-E (OpenAI, 2021) used a transformer; Stable Diffusion (Stability AI / CompVis, 2022) made high-quality diffusion generation open-source; Midjourney (2022) gained broad consumer adoption.
Reinforcement Learning
- Markov Decision Process (MDP) — formal framework: states, actions, transition probabilities, rewards, discount factor.
- Q-learning — model-free RL; learns the value of (state, action) pairs (Q-values); Deep Q-Network (DQN, 2015) — DeepMind combined Q-learning with deep neural networks to play Atari games at human level.
- Policy gradient / REINFORCE — directly optimizes the policy by estimating gradients of expected reward.
- Actor-critic — combines value estimation (critic) and policy optimization (actor); Proximal Policy Optimization (PPO) is a widely used stable variant.
- RLHF (Reinforcement Learning from Human Feedback) — human raters score outputs; reward model trained on ratings; policy fine-tuned with RL; used to align ChatGPT, Claude, and similar LLMs.
- Bellman equation — Richard Bellman (1950s); expresses the value of a state as the immediate reward plus the discounted value of successor states; the foundation of dynamic programming and Q-learning.
- Temporal difference (TD) learning — updates value estimates based on the difference between successive predictions rather than waiting for final outcomes; bridges Monte Carlo methods and dynamic programming; formalized by Sutton (1988).
- Exploration-exploitation tradeoff — the fundamental dilemma in RL: exploiting known good actions vs. exploring new actions that might yield higher reward; addressed by epsilon-greedy, UCB, and Thompson sampling strategies.
- Monte Carlo Tree Search (MCTS) — builds a search tree by random rollouts to estimate action values; combined with neural network value/policy functions in AlphaGo and successor systems.
- Deep Q-Network (DQN, 2013/2015) — DeepMind (Mnih et al.); used experience replay and a target network to stabilize Q-learning with deep neural networks; achieved human-level performance on 49 Atari games.
- AlphaStar (2019) — DeepMind; defeated top human professionals at StarCraft II using a multi-agent RL approach with an action space of thousands; published in Nature.
Landmark Systems
- Deep Blue (1997) — IBM chess computer; defeated world champion Garry Kasparov 3.5-2.5 in a match; used specialized hardware and deep search, not machine learning.
- IBM Watson (2011) — won Jeopardy! against Ken Jennings and Brad Rutter; used a pipeline of NLP and information retrieval, not a neural network.
- AlexNet (2012) — Krizhevsky, Sutskever, Hinton; pivotal for deep learning adoption; won ILSVRC 2012 by a large margin; used ReLU, dropout, GPU training; triggered the deep learning boom.
- AlphaGo (2016) — DeepMind; combined deep neural networks (policy and value networks) with Monte Carlo Tree Search; defeated world champion Lee Sedol 4-1; Go had been considered too complex for AI at the time.
- AlphaGo Zero (2017) — trained entirely from self-play without human game data; surpassed AlphaGo; demonstrated that self-supervised RL could exceed human-guided training.
- AlphaFold 2 (2020) — DeepMind; predicted protein 3D structure from amino acid sequence at near-experimental accuracy; transformed structural biology.
- GPT-3 (2020) — 175B-parameter language model; demonstrated emergent few-shot and zero-shot capabilities; commercialized as the OpenAI API.
- ChatGPT (2022) — OpenAI; GPT-3.5 fine-tuned with RLHF into a conversational assistant; reached 100 million users in ~2 months.
- Stable Diffusion / DALL-E / Midjourney (2022) — accessible text-to-image generation; diffusion-based; triggered widespread discussion of generative AI and copyright.
- AlphaCode (2022) — DeepMind; transformer trained on code; achieved competitive-programmer-level performance on Codeforces problems; demonstrated LLM viability for competitive programming.
- Codex / GitHub Copilot (2021) — OpenAI Codex fine-tuned GPT-3 on public code; powered GitHub Copilot, widely deployed AI pair-programming tool; sparked discussion of code generation and licensing.
- GPT-4 (2023) — OpenAI; multimodal (text and image inputs); significantly stronger reasoning and reduced hallucination rate compared to GPT-3.5; technical details including parameter count not officially released.
- Gemini (2023) — Google DeepMind’s multimodal large language model family; natively multimodal from training; Ultra, Pro, and Nano variants for different capability/cost tradeoffs.
- Mixtral / Mistral — Mistral AI (France); released strong open-weight models including Mixtral 8x7B, a sparse mixture-of-experts architecture that delivers GPT-3.5-level performance with lower inference cost.
Evaluation Metrics
| Task | Common Metrics |
|---|---|
| Classification | Accuracy, Precision, Recall, F1, AUC-ROC |
| Regression | MAE, RMSE, R² |
| Language generation | BLEU, ROUGE, Perplexity, BERTScore |
| Object detection | Mean Average Precision (mAP) |
| Generative images | FID (Fréchet Inception Distance) |
- Precision / Recall tradeoff — precision = TP/(TP+FP); recall = TP/(TP+FN); F1 is their harmonic mean.
- AUC-ROC — area under the receiver operating characteristic curve; measures discrimination across all thresholds; 0.5 = random, 1.0 = perfect.
- Benchmark caution — high benchmark scores can reflect data contamination (training data includes test examples) or benchmark-specific overfitting; a persistent methodological concern in LLM evaluation.
- GLUE / SuperGLUE — benchmark suites for evaluating NLP systems across diverse tasks (textual entailment, question answering, coreference); superseded by harder benchmarks as models saturated them.
- MMLU (Massive Multitask Language Understanding) — Hendrycks et al.; tests knowledge across 57 academic subjects; widely used to measure breadth of factual knowledge in LLMs.
- HumanEval — OpenAI benchmark for code generation; presents function signatures with docstrings and evaluates whether generated code passes unit tests; standard for measuring coding capability.
- Perplexity — measure of how well a language model predicts a text sample; lower perplexity = better prediction; exponential of the average negative log-likelihood; used to compare language models but not directly informative of downstream task quality.
- BLEU score — Bilingual Evaluation Understudy; measures n-gram overlap between machine-generated and reference translations; fast to compute but criticized for poor correlation with human judgment on modern high-quality outputs.
Interpretability and Explainability
- Black-box vs interpretable models — deep neural networks are hard to interpret; linear models and decision trees are inherently interpretable.
- LIME (Local Interpretable Model-agnostic Explanations, 2016) — approximates a black-box model locally with an interpretable surrogate.
- SHAP (SHapley Additive exPlanations, 2017) — assigns each feature a contribution to a prediction using Shapley values from cooperative game theory.
- Saliency maps / gradient-based attribution — highlight which input pixels or tokens most influenced a neural network’s output.
- Mechanistic interpretability — attempts to reverse-engineer the computations inside neural networks at the level of circuits and features; active research area.
- Attention as explanation — attention weights are sometimes used to interpret transformer decisions, but research shows they do not reliably reflect causal importance.
AI Safety, Bias, and Ethics
- Algorithmic bias — models can perpetuate or amplify biases present in training data; documented in facial recognition (higher error rates for darker skin tones), hiring tools, and recidivism scoring.
- Alignment problem — ensuring AI systems pursue intended goals; includes specification gaming (optimizing the measured proxy rather than the true objective) and goal misspecification.
- Fairness definitions — demographic parity, equalized odds, individual fairness; mathematically incompatible under certain conditions (Chouldechova 2017, Kleinberg et al. 2016).
- Hallucination — LLMs confidently produce plausible but false information; a central reliability concern.
- Differential privacy — a randomized algorithm M is epsilon-differentially private if adding or removing any single training record changes the output distribution by at most a factor of e^epsilon; formalized by Dwork et al. (2006); used in federated learning to guarantee that model outputs do not leak individual training records.
- Adversarial examples — small, often imperceptible perturbations to inputs cause deep networks to misclassify; demonstrated by Szegedy et al. (2014).
- EU AI Act — major regulatory framework classifying AI systems by risk tier; adopted 2024; different provisions phase in over subsequent years.
- Deepfakes — AI-generated synthetic media (video, audio, images) of real people; generated via GANs or diffusion models; major concern for misinformation and non-consensual imagery.
- Specification gaming / Goodhart’s Law in AI — when a model is optimized to maximize a proxy measure, it learns to exploit the measure rather than achieving the true goal (e.g., an RL agent finding unintended shortcuts); a core alignment concern.
- Model cards — documentation framework proposed by Mitchell et al. (2019, Google) for reporting a model’s intended uses, performance across demographics, and limitations; an early step toward AI transparency standards.
- Membership inference attack — adversarial method that determines whether a specific data point was included in a model’s training set; a privacy risk for models trained on sensitive data.
Key Concepts and Terms
- Transfer learning — pretraining on a large dataset, then fine-tuning on a smaller target task; greatly reduces data and compute requirements.
- Few-shot / zero-shot learning — model generalizes to new tasks with very few (or no) labeled examples; emergent in large language models.
- Foundation models — large models pretrained on broad data that can be adapted to many downstream tasks (term coined by Stanford HAI, 2021).
- Emergent capabilities — abilities that appear in large models but not smaller ones, not predicted by simple scaling; contested whether they are truly discontinuous.
- Scaling laws — empirical relationships between model performance and compute, data, and parameters (Kaplan et al., 2020, OpenAI); motivate larger models but show diminishing returns.
- Mixture of Experts (MoE) — a gating network routes each input token to a sparse subset of specialized expert sub-networks; allows total parameter count to scale without proportionally increasing per-token compute; used in GPT-4 (reportedly) and Mixtral.
- Prompt engineering — crafting inputs to elicit desired outputs from a language model without updating weights; chain-of-thought prompting (Wei et al., 2022) improves multi-step reasoning.
- RAG (Retrieval-Augmented Generation) — combines a retrieval system (typically vector search over a knowledge base) with generation; reduces hallucination and keeps knowledge current.
- Fine-tuning — updating model weights on task-specific data; LoRA (Low-Rank Adaptation, Hu et al. 2021) reduces the number of trainable parameters by learning low-rank updates.
- Tokenomics / context window — LLMs process a fixed-length context window of tokens; expanding context windows is an active engineering challenge.
- In-context learning — LLMs can perform new tasks by conditioning on examples provided in the prompt, without any weight update; emergent behavior at large scale; underlying mechanism is debated.
- Instruction tuning — fine-tuning a pretrained LLM on a large collection of (instruction, response) pairs across many tasks; produces a model that follows natural language instructions more reliably; FLAN (Wei et al., 2022) was an early systematic study.
- Parameter-efficient fine-tuning (PEFT) — family of methods (including LoRA, prefix tuning, adapter layers) that update only a small fraction of model weights; makes fine-tuning large models computationally accessible.
- Quantization — reducing the numerical precision of model weights (e.g., from 32-bit float to 8-bit or 4-bit integer); dramatically reduces memory and inference cost with modest accuracy loss; essential for deploying large models on consumer hardware.
- Speculative decoding — inference speedup technique in which a smaller “draft” model generates candidate tokens that a larger model then verifies in parallel; reduces the number of full-model forward passes needed.
- Vector database — database optimized for storing and querying high-dimensional embedding vectors using approximate nearest-neighbor search; examples include Pinecone, Weaviate, Chroma, and pgvector; foundational infrastructure for RAG systems.
- Agent / agentic AI — an LLM augmented with tools (web search, code execution, API calls) and the ability to plan multi-step actions in a loop; examples include AutoGPT and the ReAct framework (Yao et al., 2022).
- Neural architecture search (NAS) — automated search over neural network architectures to find high-performing designs; computationally expensive but has produced architectures (EfficientNet, NASNet) that outperform hand-designed equivalents.
Key Figures
- Alan Turing — proposed the Turing Test (1950, “Computing Machinery and Intelligence”) as a criterion for machine intelligence; foundational theoretical work on computation.
- John McCarthy — coined “artificial intelligence” (1956); developed the Lisp programming language; pioneered formal logic-based AI.
- Marvin Minsky — co-founded MIT AI Lab; contributions to frames, knowledge representation, and early neural nets; co-authored Perceptrons (1969) with Papert, which temporarily set back neural network research.
- Claude Shannon — information theory (entropy, channel capacity); attended Dartmouth; foundational to all of computing.
- Frank Rosenblatt — invented the perceptron (1958).
- Geoffrey Hinton — “Godfather of Deep Learning”; backpropagation paper (1986); Boltzmann machines; deep belief nets (2006); Capsule Networks; left Google in 2023; co-awarded the 2024 Nobel Prize in Physics.
- Yann LeCun — convolutional neural networks; LeNet; Chief AI Scientist at Meta; 2018 Turing Award (with Hinton and Bengio).
- Yoshua Bengio — deep learning theory; generative models; attention mechanisms; 2018 Turing Award.
- Andrej Karpathy — neural network research (ImageNet, Tesla Autopilot, OpenAI); widely cited pedagogy on deep learning.
- Ian Goodfellow — invented GANs (2014); Deep Learning textbook (with Bengio and Courville).
- Ilya Sutskever — co-founded OpenAI; AlexNet co-author; seq2seq learning.
- Demis Hassabis — co-founded DeepMind; led AlphaGo, AlphaFold; 2024 Nobel Prize in Chemistry (with Jumper and Baker) for protein structure prediction.
- Andrew Ng — co-founded Google Brain; led Baidu AI; founded Coursera (with Daphne Koller) and deeplearning.ai; influential in democratizing ML education; Stanford professor.
- Fei-Fei Li — creator of ImageNet; Stanford professor; co-director of Stanford HAI (Human-Centered AI); former Chief Scientist of AI/ML at Google Cloud.
- Richard Sutton — co-author of the definitive Reinforcement Learning: An Introduction textbook (with Barto); formalized temporal difference learning; strong proponent of the “bitter lesson” that general-purpose scaling beats handcrafted domain knowledge.
- Seymour Papert — MIT; co-authored Perceptrons (1969) with Minsky; also pioneered constructionist learning theory and created the Logo programming language for children.
- Jürgen Schmidhuber — co-invented the LSTM with Hochreiter (1997); has extensively documented his priority claims on many deep learning ideas; directed IDSIA in Switzerland.
- Sam Altman — CEO of OpenAI; led the organization through GPT-3, ChatGPT, GPT-4; briefly removed and reinstated by the board in November 2023.
- Dario Amodei — co-founded Anthropic (with Daniela Amodei and others after leaving OpenAI); Anthropic is the company behind the Claude family of AI assistants; focus on AI safety.
- John Hopfield — physicist who introduced the Hopfield Network (1982) as a model of associative memory; received the 2024 Nobel Prize in Physics alongside Geoffrey Hinton.
- David Rumelhart — co-authored the landmark 1986 backpropagation paper with Hinton and Williams; a founding figure of the parallel distributed processing (PDP) connectionist movement.
- Judea Pearl — developed Bayesian networks and the do-calculus for causal inference; received the 2011 Turing Award; his work underpins principled reasoning about causality in AI.
- Vladimir Vapnik — co-developed the support vector machine and statistical learning theory (VC dimension); laid theoretical foundations for machine learning generalization guarantees.