CS336 Lecture 15 - Alignment - SFT/RLHF
- Lecture overview and objectives
- Motivation for post-training and the ChatGPT transition
- InstructGPT three-step pipeline and lecture structure
- Importance of post-training data quality and collection considerations
- FLAN dataset paradigm: aggregating existing NLP datasets
- OpenAssistant and Alpaca paradigms for instruction data
- Concrete examples from FLAN and implications for instruction tuning
- Alpaca generation pipeline and response style differences
- Crowdsourcing interactive exercise and annotator response patterns
- Annotator difficulties and the incentives for AI-generated feedback
- Evaluation trade-offs: chat-style judgments versus benchmark metrics
- Risk of teaching hallucination via rich SFT data and the citation failure mode
- Mitigation perspective: abstention, tools, and reinforcement approaches
- Counterintuitive nature of ‘high-quality’ instruction data
- Safety tuning and refusal trade-offs
- Scaling instruction tuning via mid-training data mixing
- Mini-CPM example of two-stage training and data mix composition
- Practical implication: the ambiguous meaning of ‘base models’
- Concept: thinking tokens and adaptive pre-training limitations
- STAR and Quiet-STAR-style approaches for training internal thought processes
- Style imprinting and token-pattern learning in SFT (emoji example)
- Instruction tuning and knowledge acquisition limits
- Transition to reinforcement learning from human feedback (RLHF) conceptual framing
- Two primary rationales for RLHF: data cost and generator-validator asymmetry
- RLHF pipeline: pairwise feedback, reward modeling, and policy optimization
- Annotation guidelines and practicalities for pairwise preference collection
- Interactive pairwise evaluation exercise outcomes and empirical difficulty
- Ethical, demographic, and bias considerations in RLHF annotation
- AI-generated feedback adoption, datasets, and length confounds
- On-policy versus off-policy preference data and prompt sourcing
- Capabilities of self-improvement loops and practical limits
- Formal RLHF objective and reward modeling (InstructGPT equation)
- Policy optimization via policy gradients and PPO (PO) overview
- Direct Preference Optimization (DPO) derivation and practical advantages
Lecture overview and objectives
This segment outlines the scope and schedule of the lecture series, highlighting a transition from pre-training to post-training techniques with a primary focus on reinforcement learning from human feedback (RLHF) and safety/alignment.
The immediate goals are defined as:
- Convert large pre-trained language models into useful, controllable, and safe instruction-following systems.
- Preview follow-up material on verifiable reward-based training.
This lecture is positioned within a two-part post-training module and frames the central problem statement that subsequent sections address:
- How to transform a general pre-trained model into a deployable instruction-following system.
- How to add guard rails and achieve product-quality behavior.
Motivation for post-training and the ChatGPT transition
This segment motivates post-training by contrasting large-scale pre-training (e.g., GPT-3) with product-ready instruction-following systems (e.g., ChatGPT).
Key points:
- Pre-training encodes many capabilities distributed across parameters, but it does not reliably produce instruction-following behavior without targeted post-training.
- Product constraints like usability, toxicity control, and guard rails make post-training practically essential.
The section highlights broader considerations that drive post-training work:
- Societal impact of deployed models.
- Research questions on data collection, algorithms, and scalability required to make pre-trained models safe and useful in real products.
InstructGPT three-step pipeline and lecture structure
This segment introduces the InstructGPT three-step post-training pipeline as the organizing framework:
- Supervised fine-tuning (SFT) on demonstrations.
- Reward modeling from pairwise feedback.
-
Reinforcement learning for policy optimization.
The lecture will follow this structure, covering:
- SFT (leftmost block) first.
- Then pairwise feedback and RLHF.
- Finally practical and algorithmic details.
It sets up two axes of concern for SFT:
- The nature and cost of training data.
- The adaptation method (e.g., gradient-based fine-tuning vs. alternative approaches).
Importance of post-training data quality and collection considerations
This segment argues that post-training data matters more than ever due to small-data, high-leverage effects: noisy instruction-tuning data can provably harm model behavior and desirable behaviors must be encoded precisely.
Practical concerns for data collection include:
- Annotator expertise and variability.
- Annotation cost and incentives.
- Style and length variability across examples.
- Safety coverage and handling edge cases.
- Trade-offs between quantity and per-example quality.
The segment positions these considerations as prerequisites for selecting downstream algorithmic approaches, noting that some data types (e.g., expert demonstrations, pairwise feedback) are more directly usable than others.
FLAN dataset paradigm: aggregating existing NLP datasets
This segment documents the FLAN paradigm: many standard supervised NLP datasets (QA, classification, NLI, etc.) are reformatted and aggregated into a single instruction-style corpus for instruction tuning.
How FLAN works and why it’s used:
- It leverages existing labeled benchmarks (e.g., Natural Instructions, adversarial QA) to provide broad task coverage at scale.
- Requires minimal human authoring of new chat-style examples since it reformats existing datasets.
Strengths and weaknesses:
- Strengths: scale, task diversity.
- Weaknesses: surgical reformatting can produce unnatural prompt-response distributions that differ from real chat interactions and introduce distributional artifacts.
OpenAssistant and Alpaca paradigms for instruction data
This segment contrasts two other paradigms: OpenAssistant and Alpaca.
OpenAssistant:
- A crowdsourced, human-authored instruction dataset.
- Produces long, detailed responses and often includes citations.
- Emphasizes human quality and richness.
Alpaca:
- Starts from a small seed of human-written instructions that are expanded by a language model and filled by a stronger model (a distillation-style process).
- Emphasizes cheap synthetic instruction data that closely matches conversational inputs.
Trade-offs and differences:
- Diversity and authoring cost differ dramatically.
- Artifact profiles vary by paradigm: response length, citation behavior, and multimodal style are distinct across them.
Concrete examples from FLAN and implications for instruction tuning
This segment inspects representative FLAN examples (article highlights, multiple-choice classification, email subject generation, E2E text generation) to show how reformatted benchmark data appears when repurposed for instruction tuning.
Key observations:
- Much of FLAN-style data is task-centric, often brief, and visibly constructed.
- This brevity and artificial construction can create a mismatch with real chat-style user interactions.
Conclusion: aggregated benchmark data provides scale but introduces distributional artifacts that can bias instruction-following behavior and style.
Alpaca generation pipeline and response style differences
This segment describes the Alpaca pipeline in detail:
- Start with a set of human-written instructions.
- Use a language model to generate augmented instructions.
- Use an instruction-capable model (e.g., InstructGPT) to produce long-form completions.
Characteristics and trade-offs:
- Alpaca-style data tends to produce conversational, long-form, chat-like inputs and outputs compared to terse FLAN outputs.
- Generated data can be less diverse in instruction forms and may preserve biases or limitations of the generator.
Emphasis: there is a trade-off between human-authored richness and model-generated scale and uniformity.
Crowdsourcing interactive exercise and annotator response patterns
This segment reports on a crowdsourcing-style interactive exercise where participants author instruction-following responses, illustrating typical annotator behaviors and quality differences.
Empirical findings:
- Many annotators produce short, low-effort responses under time pressure.
- Obtaining high-quality, long-form human responses is costly and difficult to scale.
Common artifacts in crowd data:
- Emoji insertion, templated answers, shallow factuality checking.
- The practical challenge of eliciting consistent, detailed demonstrations at scale is substantial.
Annotator difficulties and the incentives for AI-generated feedback
This segment analyzes why annotators often produce short or low-quality responses: time constraints, lack of motivation, and task unfamiliarity.
Empirical observation: human and AI evaluators tend to prefer longer, more detailed outputs ~60–70% of the time, which introduces bias toward verbosity if evaluation or reward correlates with length.
This partly explains the rise of AI-generated feedback: models like GPT-4 produce long, polished outputs at low marginal cost, making synthetic supervision attractive to practitioners.
Evaluation trade-offs: chat-style judgments versus benchmark metrics
This segment compares open-ended chat-style evaluation methods (user engagement, chatbot arena, AlpacaEval) with traditional benchmark evaluations (e.g., MMLU).
Trade-offs:
- Chat-style evaluations capture user-facing quality and preference, but introduce confounders such as output length and style.
- Benchmarks provide stable, task-specific fidelity metrics that are less sensitive to surface stylistic changes.
Recommendation: use a diverse evaluation suite combining interactive preference tests and standardized benchmarks to mitigate evaluation biases during post-training.
Risk of teaching hallucination via rich SFT data and the citation failure mode
This segment articulates a key failure mode: when SFT displays high-quality, citation-rich targets that the base model does not already know, the model can learn superficial structural patterns (e.g., append a citation) rather than correct factual retrieval.
Two simultaneous effects of such SFT examples:
- When the model already encodes the fact, the SFT can produce correct knowledge acquisition.
- When the model lacks the underlying knowledge, SFT encourages hallucinatory fabrications of supporting details.
This is a distributional mismatch: SFT can incentivize inappropriate surface-level completions if the model cannot reliably ground answers.
Mitigation perspective: abstention, tools, and reinforcement approaches
This segment explores mitigation strategies for hallucination and inappropriate surface behavior:
- Train models to abstain when uncertain.
- Augment models with retrieval/tooling to ground claims.
- Use reinforcement learning methods to selectively teach behaviors the model can reliably perform.
Practical notes:
- Abstention and tool-use require explicit dataset design or tool-augmented training.
-
RLHF and reward-based methods can in principle shape models to prefer abstention or verification when factuality is not assured.
Empirical point: models reproduce known facts more readily than they learn new unknown facts through small SFT datasets.
Counterintuitive nature of ‘high-quality’ instruction data
This segment summarizes a counterintuitive conclusion: highly correct and knowledge-rich SFT datasets can nevertheless degrade downstream behavior because they encourage outputs that mimic high-level structures without true grounding.
Why this paradox matters:
- Exposing a model to expert outputs is insufficient if the model lacks the internal representation to support those outputs truthfully.
- Motivates cautious dataset engineering, careful distillation practices, and alignment-aware training.
Recommendation: prefer techniques that allow the model to abstain or request tools rather than forcing proximate but potentially false completions.
Safety tuning and refusal trade-offs
This segment addresses safety-specific post-training considerations, describing safety tuning as targeted instruction-tuning that mixes refusal, content moderation, and safer completions into training data.
Central trade-off:
- Excessive refusal reduces utility by blocking legitimate, nuanced queries.
-
Insufficient refusal allows harmful or toxic outputs to surface.
Evidence and practice:
- Even small curated safety datasets (on the order of hundreds of examples) can meaningfully alter behavior.
-
Mixing safety data into broader instruction tuning is a pragmatic mitigation strategy.
Scaling instruction tuning via mid-training data mixing
This segment explains the practice of mixing instruction-tuning or high-quality supervision datasets into later stages of large-scale pre-training (a mid-training or decay-stage mix).
Operational recipe:
- Conduct standard pre-training.
- Transition into a decay stage where the learning rate is annealed.
- Progressively add higher-quality or instructional data into the training mix to integrate instruction-following behavior without catastrophic forgetting.
Discussion: mid-training blending blurs the boundary between a base pre-trained model and a post-trained instruction-tuned model and provides empirical leverage by integrating behavioral data at scale.
Mini-CPM example of two-stage training and data mix composition
This segment uses the mini-CPM example to illustrate concrete data-mix engineering: a primary pre-training phase of common-crawl, code, and large corpora followed by a decay/second stage that injects Wikipedia, code-SFT, chat data, and instruction-adjacent corpora.
Design rationale:
- The decay stage intentionally contains a non-pure mixture (retaining some pre-training data) to avoid catastrophic forgetting.
- This smoothly shifts model behavior toward instruction-following while preserving core capabilities.
Empirical note: this two-stage approach is widely used in practice and is effective at integrating instruction-style behaviors into large transformers.
Practical implication: the ambiguous meaning of ‘base models’
This segment highlights that widespread mid-training and data mixing complicate the notion of a canonical “base model”: many labeled base checkpoints are already influenced by instruction-like data.
Consequences for evaluation and comparison:
- Claims about model capabilities should account for whether the claimed base model includes mid-training instruction mixtures or represents pure pre-training.
- Recommend careful documentation of data mixtures and training stages when reporting model baselines and public releases.
Concept: thinking tokens and adaptive pre-training limitations
This segment discusses an idea to embed ‘thought’ or verification tokens in pre-training to make the model check its knowledge during training, then assesses practical limitations.
Key observations:
- Adaptive schemes that check whether the model knows a fact before applying supervision begin to resemble reinforcement learning because they require conditional updates based on current model state.
- Engineering constraints: static pre-training datasets and batched computation make adaptive per-example decisions difficult at pre-training scale without an RL-style training loop and dynamic data generation.
STAR and Quiet-STAR-style approaches for training internal thought processes
This segment connects internal verification ideas to existing methods such as STAR and Quiet-STAR, which train models to generate intermediate ‘thought’ traces and then reinforce those traces conditionally on final correctness.
Operational pattern:
- Predict a chain-of-thought or internal reasoning sequence.
- Evaluate the outcome.
- Reinforce thought traces that lead to correct outputs.
Interpretation: these methods effectively implement a form of reward-guided supervised learning, akin to early RL-style interventions that shape internal reasoning behaviors.
Style imprinting and token-pattern learning in SFT (emoji example)
This segment analyzes how instruction tuning reliably transfers output style and token-level signatures—using the emoji example to illustrate that routine post-training tokens get reproduced.
Main takeaways:
- Instruction tuning most reliably teaches the ‘type signature’ or surface style of outputs (formatting, length, token presence).
-
Deeper grounded behavior requires sufficiently diverse and content-rich supervision relative to pre-trained knowledge.
The segment emphasizes: surface-style learning is a predictable outcome of SFT, while substantive content learning is scale- and distribution-dependent.
Instruction tuning and knowledge acquisition limits
This segment clarifies that instruction tuning can teach new factual knowledge if it is scaled and diversified sufficiently—especially when integrated via mid-training—but smaller SFT-only routines typically have limited capacity to reliably instill novel world knowledge.
Key distinctions:
- Differences hinge on data scale, diversity, and whether instruction examples are exposed in a pre-training-like mixing regime.
-
Mid-training hybrids bridge pre-training knowledge acquisition and SFT behavioral shaping, enabling both content learning and style transfer.
Transition to reinforcement learning from human feedback (RLHF) conceptual framing
This segment sets up the conceptual transition from probabilistic generative modeling to a reward-maximization perspective: instead of estimating a generative reference distribution p*, RLHF treats the LM as a policy π(y|x) to be optimized for expected reward R(x,y).
Formal objective shift:
- Training target becomes any policy that attains high expected reward under a human-aligned reward function.
- This decouples training from the goal of matching a specific data distribution and reframes the problem as policy optimization.
The segment motivates RLHF on conceptual grounds and contrasts distributional imitation with policy optimization.
Two primary rationales for RLHF: data cost and generator-validator asymmetry
This segment articulates two practical motivations for RLHF:
- Expert demonstrations are expensive to collect at scale.
-
Human validators can often judge or compare outputs more cheaply and sometimes more reliably than producing full demonstrations, creating a generator–validator gap.
Empirical evidence: validators sometimes prefer model outputs to their own generations, making preference data an efficient supervisory signal.
The segment frames pairwise preference collection and reward modeling as approaches that leverage cheaper validator comparisons to guide policy optimization.
RLHF pipeline: pairwise feedback, reward modeling, and policy optimization
This segment outlines the standard RLHF pipeline:
- Collect model outputs (rollouts).
- Collect pairwise human preferences comparing outputs.
- Train a reward model to predict scalar scores that explain pairwise judgments (e.g., Bradley–Terry/logistic differences).
- Optimize the policy to increase expected reward, often with a KL penalty to limit divergence.
Practical details:
- Data collection interfaces include side-by-side comparisons and guidelines emphasizing helpfulness/truthfulness/harmlessness.
-
Annotation guidelines shape reward model semantics; the reward model operationalizes human preferences into a differentiable scalar objective for RL.
Annotation guidelines and practicalities for pairwise preference collection
This segment reviews concrete annotation guidelines used in practice (e.g., InstructGPT, leaked Bard guidelines) that operationalize evaluation into pillars such as helpfulness, truthfulness, and harmlessness, plus style considerations and rating scales.
Practical constraints and sensitivities:
- Annotator time budgets are often around one minute per example.
- Early RLHF efforts used small annotator pools, increasing susceptibility to bias.
-
Clear instructions, rater training, and vendor management materially affect the reliability and systematic biases of preference data.
Interactive pairwise evaluation exercise outcomes and empirical difficulty
This segment reports on an in-class pairwise evaluation exercise, documenting empirical challenges:
- Limited time drastically reduces factual checking.
- Annotators often choose longer but hallucinated responses.
- Distinguishing correctness requires decomposing outputs into verifiable claims, which is labor-intensive.
Conclusion: large-scale, high-quality pairwise annotation is hard, expensive, and susceptible to annotator shortcuts (e.g., copying model outputs into comparator tools).
Ethical, demographic, and bias considerations in RLHF annotation
This segment discusses ethical and bias considerations in RLHF data collection, including annotator compensation, demographic composition effects, and cultural/regional value alignment shifts induced by annotator pools.
Empirical evidence: annotator nationality and background can systematically alter model preferences (alignment shifts correlated with annotator locations), and different annotator cohorts sometimes prioritize format over factuality.
Recommendations: careful annotator selection, fair payment practices, and transparency about annotator demographics to mitigate unwanted alignment shifts.
AI-generated feedback adoption, datasets, and length confounds
This segment surveys the adoption of AI-generated feedback for preference labeling (e.g., using GPT-4 or other LMs as cheaper raters), noting strong empirical agreement between AI judges and humans and cost advantages of synthetic raters.
Examples and strategies:
- Projects that heavily use LM-generated preference labels include UltraFeedback, Tulu3.
-
Constitutional AI is cited as a rule-based synthetic supervision strategy.
Warning: a dominant confound is that both human preferences and automated judges correlate with output length, causing reward signals to favor verbosity unless explicitly controlled.
On-policy versus off-policy preference data and prompt sourcing
This segment defines on-policy preference data as comparisons sampled from the model being trained and off-policy data as comparisons drawn from other models or pre-existing corpora.
Roles and trade-offs:
- Off-policy data help characterize the broader landscape of possible outputs and can bootstrap reward models.
-
On-policy data provide targeted feedback for local policy improvement.
Prompt sourcing note: using prompts with known answers or existing ground truth helps in some domains (e.g., math) but does not generalize to many open-ended tasks where correctness is subjective or hard to adjudicate.
Capabilities of self-improvement loops and practical limits
This segment examines self-improvement strategies where models generate outputs and then judge or refine them iteratively, exploiting latent pre-training information via clever prompting and self-refinement.
Key caveats:
- Models theoretically contain vast pre-training information, but empirical upper bounds and scaling behavior are uncertain.
- Effectiveness depends heavily on prompt engineering, model size, and the nature of the self-critique loop.
Framing: self-improvement is both an information-theoretic and empirical question; practical effectiveness varies across tasks.
Formal RLHF objective and reward modeling (InstructGPT equation)
This segment presents the formal RLHF optimization objective popularized by InstructGPT: maximize expected reward under a learned reward model while constraining the policy with a KL divergence term relative to a reference policy (often the SFT policy).
Objective components:
- Expected reward term: Eπ[R(x,y)].
-
KL penalty: β KL(π π_ref) to prevent excessive deviation and catastrophic forgetting. - The reward model is learned from pairwise comparisons via a Bradley–Terry/logistic observation model mapping comparative judgments to scalar utilities.
Additional regularization may be added to maintain desirable pre-training behaviors.
Policy optimization via policy gradients and PPO (PO) overview
This segment introduces policy-gradient-based methods used in RLHF (e.g., the PPO family), deriving gradients for expected reward and describing variance-reduction via advantage estimates and baseline subtraction.
Practical algorithmic elements:
- Collect rollouts from the current policy.
- Compute advantages (reward minus baseline).
- Apply importance-weighted gradient updates for off-policy corrections if needed.
- Constrain updates using ratio clipping or KL penalties to stabilize training.
Note: PPO-style clipping replaces explicit constrained optimization with a practical surrogate that limits policy update magnitudes.
Direct Preference Optimization (DPO) derivation and practical advantages
This segment derives Direct Preference Optimization (DPO) by making a nonparametric mapping between policy and implied reward, using the Bradley–Terry model for pairwise comparisons, and converting the RLHF objective into a supervised, maximum-likelihood-like loss over preferences.
Core steps and rationale:
-
Assume the optimal policy satisfies π*(y x) ∝ p_ref(y x) exp(α R(x,y)). - Invert this relationship to express R in terms of π and π_ref.
- Optimize π directly by minimizing a preference-based likelihood without an explicit learned reward model or on-policy importance correction.
Pragmatic benefits of DPO:
- Removes the explicit reward model.
- Offers simpler training dynamics.
- Empirically effective as a less complex alternative to PPO-style RLHF.
Enjoy Reading This Article?
Here are some more articles you might like to read next: