Lecture scope and objectives

The lecture presents a systematic examination of transformer architectures and training practices, emphasizing practical choices over introductory theory.

It frames the investigation as a data-driven analysis of many recent large language model releases to extract convergent design patterns and stability interventions that practitioners use when training very large models.

The intended outcome is concrete guidance on which components and hyperparameters consistently work well in production-scale training and inference contexts.

  • Covers three major focus areas:
    • Architecture variants — how model building blocks are arranged and modified
    • Hyperparameter selection — which numeric settings matter for stability and performance
    • Attention / implementation variants — inference-oriented choices that affect latency and memory

This framing prioritizes actionable takeaways for practitioners building or reproducing large models.


Canonical transformer components and modern variation points

A standard transformer contains several core components: token embeddings (with positional information), multi-head self-attention, layer normalization, residual connections, feed-forward MLP blocks, and a final softmax output.

Modern variants preserve the core attention + MLP structure while altering many pieces. Key variation points include:

  • Layer normalization location/typepre‑norm vs post‑norm and alternatives like RMSNorm
  • Positional encodings — learned absolute, relative, or methods like Rotary Positional Embeddings (RoPE)
  • MLP internals — activation and gating choices (e.g., GeLU, SwiGlu) and expansion ratios
  • Bias usage — presence or absence of bias terms in linear projections
  • Sublayer arrangementserial versus parallel arrangements of attention and MLP blocks

Understanding these variation points lets you map specific model instantiations onto a common design space and reason about trade-offs in expressiveness, optimization behavior, and runtime/memory characteristics that affect both training and deployment.


Course implementation choices vs original transformer

Contemporary training codebases typically adopt pragmatic modifications relative to the original Vaswani et al. transformer. These changes are motivated by empirical stability and performance gains at scale.

Common engineering choices include:

  • Pre-layer normalization (pre-norm) — improves optimization stability and reduces the need for complex warm-ups
  • Rotary Position Embeddings (RoPE) — supplies relative position information that often extrapolates to longer contexts
  • Gated MLP variants (e.g., SwiGlu) — gated activations that tend to lower loss and speed convergence
  • Removal of bias terms in many linear layers — reduces memory movement and can improve training stability

Taken together, these practices form a de facto modern transformer variant encountered in many production implementations. Treating them as canonical in implementation helps align training recipes with current large-scale practice and supports reproducibility.


Survey of recent model evolution and convergent design trends

An empirical survey of dense LLM releases (2017–2025) shows many small architectural tweaks, but also a convergent evolution toward a subset of design decisions across independent teams.

Observed convergences include:

  • Broad adoption of Rotary Positional Embeddings (RoPE) since ~2023
  • Widespread use of pre-norm layer normalization for stability
  • Increasing adoption of gated linear units for MLPs (e.g., SwiGlu-style blocks)
  • A trend toward removing bias parameters and substituting RMSNorm for LayerNorm in some families

Cataloguing these patterns across many models yields a valuable dataset for extracting robust heuristics and engineering defaults. The survey approach helps practitioners prioritize changes that repeatedly correlate with stable training and strong downstream performance.


Lecture structure: architecture, hyperparameters, attention variants

The material is organized into three major sections that follow the lifecycle from model design through inference optimization:

  1. Architecture variations
    • Activations, feed-forward designs, attention mechanisms, and positional encodings
  2. Hyperparameter selection
    • Hidden sizes, MLP expansion ratios, head dimensions, vocabulary sizes, and model aspect ratios
  3. Attention / inference variants
    • KV caching, multi-query / group-query attention, and long-context strategies

This structure separates design concerns by stage (construction → training → inference) and enables targeted discussion of stability and systems interactions at each step.

The emphasis is practical: highlight empirical evidence for defaults that are robust across multiple model families and scales, while calling out areas where research is still active and choices remain heterogeneous.


Pre-norm versus post-norm layer normalization

Pre-layer normalization (pre-norm) places normalization before each residual sub-block and is the dominant pattern in modern transformer training.

Why it’s preferred:

  • Improves optimization stability, reducing large loss spikes and extreme gradient norms
  • Lessens the need for complex learning-rate warm-up schedules that post‑norm variants historically required
  • Enables easier scaling to many layers and larger parameter counts, which supports training of contemporary LLMs

Empirical comparisons indicate steadier gradients and fewer divergence incidents with pre-norm versus post-norm in deep stacks. While exceptions exist for specialized architectures or research probes, pre-norm should be the default for large transformer training unless a targeted study justifies an alternative.