mindmap Backprop-driven Neural Learning Loss & Objective Softmax / Hinge Regularization λ·RW Optimization SGD + mini-batches Momentum / Adam · LR schedules Architecture & Activations Layers · hidden units · dims ReLU family vs sigmoid · expressivity Computational Graphs & Efficient Backprop DAG + reverse-mode AD Matrix-backward identities · sparse Jacobians Implementation Practices Modular forward/backward gates Analytic grads · numerical check

Lecture scope: neural networks and backpropagation as foundational methods

This lecture introduces neural networks and backpropagation as the core mathematical and algorithmic mechanisms that enable layered models to learn from data by computing gradients of a loss with respect to parameters.

  • Backpropagation is presented as the general procedure used across modern learning algorithms to propagate error derivatives from outputs back to parameters in multi-layer architectures.
  • This propagation enables gradient-based optimization via variants of gradient descent.
  • Understanding these ideas is framed as essential for subsequent topics and practical model development.

Review of loss functions: softmax, hinge loss, and scoring

Loss functions convert model scores into scalar objectives to be minimized during training.

  • Common choices:
    • Softmax cross-entropy — normalizes scores into probabilities and optimizes negative log-likelihood; typically used for probabilistic multiclass classification.
    • Hinge (SVM) loss — enforces margin-based separation without producing probabilities; penalizes cases where the true class score is not larger than competing class scores by at least a margin (often 1). It yields zero loss when the margin constraint is satisfied and increases linearly otherwise.
  • Losses are commonly combined with parameter regularization to form the total objective:
    • L = data_loss + lambda * regularizer(W), where lambda scales the regularization penalty.
  • The choice of loss depends on task requirements and empirical performance.

Gradient-based optimization: gradients, gradient descent, mini-batches, and advanced optimizers

Model parameters are found by minimizing the loss landscape using gradients ∂L/∂W.

  1. Compute gradient and update parameters in the negative gradient direction scaled by a learning rate (gradient descent): W ← W − η ∂L/∂W.
  2. For large datasets, computing gradients on the full dataset is often infeasible, so use stochastic gradient descent (SGD) with mini-batches (typical sizes: 32, 64, 128, 256) to approximate the true gradient and update iteratively.
  3. Practical optimizers extend SGD:
    • Momentum, RMSProp, Adam, and variants accelerate convergence and adapt step sizes.
  4. Learning rate scheduling (decay, step reductions) remains important, although some adaptive optimizers partially encode step-size adaptation internally.

From linear classifiers to multi-layer networks: dimensions, layers, and hidden units

A single-layer linear classifier computes scores s = W x (plus bias), mapping D-dimensional inputs to C outputs — the simplest neural model.

  • Multi-layer networks add one or more hidden layers with H units and additional weight matrices (e.g., W1, W2) so intermediate representations can be learned.
  • Maintain dimensional consistency across layers (input dimension D, hidden H, output C).
  • Adding layers increases representational capacity by enabling successive linear transformations followed by nonlinearities, mapping inputs into spaces where classes can be linearly separated.

Role of nonlinearities and activation functions in expressivity

Activation functions insert nonlinearity between linear layers so stacked layers cannot be collapsed into a single linear mapping — this nonlinearity is essential to learn complex, nonlinearly separable functions.

  • Common activations and tradeoffs:
    • ReLU (rectified linear unit, max(0,x)): computationally efficient, mitigates vanishing gradients for positive regimes, but can suffer dead neurons.
    • Leaky ReLU / ELU: address zero-region issues by allowing a small gradient when x < 0.
    • Swish and variants: used in recent architectures with competitive empirical performance.
    • Sigmoid / Tanh: squash values and can cause vanishing gradients, so they are often reserved for final layers or specialized use.
  • Activation choice is a hyperparameter often selected empirically or by following architecture conventions, considering differentiability, zero-centeredness, gradient dynamics, and convergence speed.

Activation selection and hyperparameter practice

Choosing activation functions and other architectural hyperparameters is largely empirical and problem-dependent.

  • Practitioners typically begin with well-tested defaults (e.g., ReLU for CNNs) and explore alternatives when necessary.
  • Key desiderata:
    • Preserve gradient flow (avoid vanishing gradients).
    • Suitable zero-centering.
    • Computational efficiency and training stability.
  • Research offers theoretical and empirical guidance, but practical selection often relies on prior work for similar tasks and iterative hyperparameter tuning.

Practical two-layer network implementation and the centrality of analytic gradients

Implementing a two-layer neural network in a high-level language typically involves these steps:

  1. Define input/output dimensions and hidden layer size.
  2. Initialize weight matrices and biases.
  3. Perform a forward pass to compute intermediate representations, scores, and the loss.
  4. Compute gradients for an optimization step (backward pass).
  5. Optionally perform numerical gradient checking to validate correctness.
    • In practice, analytical gradients (derived by hand or via automatic differentiation) are used for efficiency and numerical stability.
    • The bulk of engineering effort is often in correctly formulating and implementing the backward pass to compute ∂L/∂W for all trainable parameters.

Model capacity, templates, and regularization versus network size

Increasing the number of hidden units increases model capacity, enabling the network to learn more complex decision boundaries and part-based templates (e.g., shared object parts across classes).

  • Excessive capacity can lead to overfitting; regularization constrains parameters:
    • Common techniques: weight decay, dropout, etc.
  • The regularization coefficient lambda scales the penalty on weights:
    • Too large lambdaunderfitting (overly constrained weights).
    • Too small lambdaoverfitting (memorization).
  • The trade-off between capacity and regularization is typically tuned empirically.

Biological inspiration and limits of brain analogies

Artificial neural networks borrow high-level metaphors from biological neurons (inputs aggregated by a cell body and transmitted via axons), but these analogies are loose and should not be overinterpreted.

  • Computational neurons implement mathematical functions (weighted sums and activations) for efficiency and simplicity.
  • Biological neural dynamics are far more complex; architecture design prioritizes implementation efficiency, scalability, and empirical performance rather than biological fidelity.

Scoring functions, total loss and the need for derivatives for optimization

A model maps inputs through weights to produce scores, which are passed to a loss function (e.g., softmax cross-entropy or hinge) and combined with regularizers to form the total objective L(W).

  • Optimization requires computing partial derivatives ∂L/∂W for all parameters.
  • Deriving these derivatives algebraically for complex models by hand is error-prone and impractical for large architectures or custom losses, motivating systematic computational frameworks for automated derivative computation.

Computational graphs as the structural substrate for automatic differentiation

A computational graph represents a model as a directed acyclic graph (DAG) of elementary operations mapping inputs and parameters to the loss output.

  • Constructing this graph explicitly enables:
    • Local forward computations at each node.
    • Systematic backward propagation of gradients via the chain rule.
  • Each node exposes a local Jacobian (or local gradient rule), receives an upstream gradient from successors, multiplies them to yield downstream gradients for predecessors, and thus implements reverse-mode automatic differentiation efficiently and modularly.

Backpropagation on a scalar example (f = (x + y) * z): local gradients, chain rule, upstream/downstream

For the scalar function f(x,y,z) = (x + y) * z, forward evaluation and backpropagation proceed as follows:

  1. Forward pass:
    • Compute Q = x + y.
    • Compute f = Q * z.
  2. Local derivatives:
    • ∂Q/∂x = 1, ∂Q/∂y = 1.
    • ∂f/∂Q = z, ∂f/∂z = Q.
  3. Backpropagation:
    • Start with upstream gradient ∂f/∂f = 1.
    • Multiply the upstream gradient by each node’s local gradient to produce downstream gradients.
    • This modular multiply-and-propagate yields ∂f/∂x, ∂f/∂y, and ∂f/∂z without computing a global closed-form derivative.

Purpose of gradients: parameter updates and intuitive interpretation

Gradients of the loss with respect to parameters indicate sensitivity: ∂L/∂W quantifies how a small change in a parameter affects the loss.

  • Update rule (gradient descent): W ← W − η ∂L/∂W, where η is the learning rate.
  • Backpropagation efficiently computes these sensitivities for every parameter in a multi-layer network by propagating gradients from the loss to each parameter.
  • Intuition: gradients point in the direction in parameter space that increases loss, so negative gradients provide the descent direction for optimization.

Backpropagation through a composite scalar function (sigmoid of linear combination): numerical example

A sigmoid applied to a linear combination implements σ(a) with a = w0 x0 + w1 x1 + ….

  • Forward propagation evaluates linear sums, exponentials, and division to produce the output σ(a).
  • Backward propagation uses the convenient local gradient:
    • σ’(a) = σ(a) (1 − σ(a)).
  • Backpropagating through a sigmoid requires only the sigmoid value and the upstream gradient, so chaining local gradients yields parameter gradients such as:
    • ∂L/∂w_i = x_i * upstream_grad * σ’(a), enabling parameter updates.

Elementary gate patterns and modular forward/backward APIs

Common computation gates exhibit simple, reusable backward rules that make modular automatic differentiation practical:

  • Add gate: distributes upstream gradients unchanged to inputs.
  • Multiply gate: swaps variables in the gradient (∂(xy)/∂x = y).
  • Copy gate: duplicates gradients to multiple consumers.
  • Max / ReLU gate: routes gradient only to the winning input (the input that achieved the max).
  • These local rules support implementing modular forward() and backward() APIs for each operator, storing necessary inputs during the forward pass and encapsulating local backward computations — the design pattern used by deep learning libraries.

Vector, matrix and tensor backpropagation: Jacobians, sparsity and memory considerations

Extending backpropagation to vectors and tensors generalizes scalar derivatives to Jacobians and higher-order derivative tensors.

  • When outputs are vectors, local gradients with respect to inputs are matrices whose multiplication with upstream gradient vectors yields downstream gradients with matching input shape.
  • Naive Jacobian construction is often infeasible (Jacobian sizes can exceed memory), but many operations are structured so backward rules can be implemented with efficient matrix operations and broadcasting rather than materializing full Jacobians:
    • For elementwise ops like ReLU the Jacobian is diagonal (sparse).
    • For matrix multiply the backward pass can be implemented via other matrix multiplies (avoiding explicit large Jacobians).

Efficient matrix-multiply backward formulas and practical implementation patterns

For Y = X W (batch X of shape [N, D], W of shape [D, M], Y of shape [N, M]), the backward pass uses matrix identities:

  • ∂L/∂X = ∂L/∂Y · W^T
  • ∂L/∂W = X^T · ∂L/∂Y

  • These formulas implement the ‘swap’ property of multiplication and avoid forming the full Jacobian of size (NM) × (ND).
  • Leveraging these matrix identities enables scalable training with common mini-batch sizes and feature dimensions, and forms the basis for efficient implementations in numerical libraries and deep learning frameworks.