Karpathy Series - Intro to Large Language Models
- Large language models are distributed as a parameter file plus executable code and can run locally for inference
- Obtaining model parameters requires a large-scale training run that compresses training corpora into the model weights
- Large language models are trained as next-token predictors and thereby internalize knowledge that enables diverse behaviors
- Transformer architectures implement the computations of modern large language models but the internal representations remain difficult to fully interpret
- Production-quality conversational assistants are produced by fine-tuning pretrained base models on curated Q&A datasets
- Human comparison labels and reinforcement learning from human feedback (RLHF) provide an additional alignment stage
- Benchmark leaderboards reveal a two-tier ecosystem of closed proprietary models and open-weight models
- Scaling laws describe predictable improvements from increasing model size and training data
- Tool use augments language models by giving them access to external computation, browsing, and visualization
- Multimodality—image and audio understanding/generation—extends model capabilities beyond text and enables new applications
- System 1 versus system 2 describes desiderata for time-dependent deliberation and improved reasoning
- Self-improvement via reinforcement and sandboxed evaluation is possible in closed environments but remains challenging for general language tasks
- Customization enables specialized expert models via retrieval, custom instructions, and fine-tuning
- Large language models can be conceived as kernel processes of an emerging ‘LLM operating system’ that orchestrates memory, tools, and agents
- LLM deployments introduce a novel security attack surface including jailbreaks, adversarial encodings, and adversarial artifacts in images
- Prompt injection attacks hijack model behavior by embedding instructions within retrieved documents or media and can enable phishing and data exfiltration
- Data poisoning and backdoor trigger phrases can corrupt models during training or fine-tuning, producing conditional failures
Large language models are distributed as a parameter file plus executable code and can run locally for inference
A large language model typically consists of two artifacts: a parameters file containing the trained weights and a runtime implementation that performs the forward pass.
-
Parameters file
- Stores the numeric coefficients of the neural network (commonly float16).
- Can be very large — for example, a 70B-parameter model stored as 2 bytes per parameter yields on the order of 140 GB.
- Stores the numeric coefficients of the neural network (commonly float16).
-
Runtime implementation
- Can be a small, dependency-free program (for example a compact C binary) that loads the parameters, implements the model architecture, and executes autoregressive sampling to produce text.
- Can be a small, dependency-free program (for example a compact C binary) that loads the parameters, implements the model architecture, and executes autoregressive sampling to produce text.
Because the architecture and weights are self-contained, inference can be performed offline on a capable workstation without network connectivity.
However, larger models impose heavier CPU/GPU and memory requirements — practical consumer usage commonly employs smaller variants of the model to achieve interactive latency.
Obtaining model parameters requires a large-scale training run that compresses training corpora into the model weights
Model training is an expensive, large-scale optimization process that ingests massive text corpora and produces the parameter file used for inference.
- Data and compute scale
- Training state-of-the-art models involves collecting petabyte-scale raw data (for illustration, pretraining datasets are on the order of multiple terabytes of text).
- It requires provisioning very large GPU clusters (often tens of thousands of accelerators at the highest scale) and running distributed optimization for days or weeks at multimillion-dollar cost.
- Training state-of-the-art models involves collecting petabyte-scale raw data (for illustration, pretraining datasets are on the order of multiple terabytes of text).
- Nature of trained parameters
- The trained parameters function as a lossy compression of the training distribution: they retain statistical structure and knowledge useful for generation but do not store exact copies of source documents.
- The trained parameters function as a lossy compression of the training distribution: they retain statistical structure and knowledge useful for generation but do not store exact copies of source documents.
- Industry practice
- Current practice often scales both model size (parameters) and dataset size according to predictable scaling laws to improve next-token prediction accuracy, driving large compute investments in addition to algorithmic work.
- Current practice often scales both model size (parameters) and dataset size according to predictable scaling laws to improve next-token prediction accuracy, driving large compute investments in addition to algorithmic work.
Large language models are trained as next-token predictors and thereby internalize knowledge that enables diverse behaviors
The fundamental pretraining objective is next-token prediction: given a sequence of tokens the model estimates a probability distribution over the next token and is optimized to minimize prediction error.
- Consequences of the objective
- This objective forces the model to capture syntactic and semantic regularities across vast corpora, so the resulting parameters encode substantial factual and procedural knowledge even though the task is purely predictive.
- This objective forces the model to capture syntactic and semantic regularities across vast corpora, so the resulting parameters encode substantial factual and procedural knowledge even though the task is purely predictive.
- Inference behavior and risks
- At inference time the model samples tokens autoregressively, producing fluent text that can resemble training documents but may hallucinate fabricated facts (for example invented ISBNs or plausible-sounding product metadata).
- At inference time the model samples tokens autoregressively, producing fluent text that can resemble training documents but may hallucinate fabricated facts (for example invented ISBNs or plausible-sounding product metadata).
- Why models can perform many tasks
- The predictive/compression equivalence explains why next-token training yields models that can answer questions, provide factual descriptions, and perform many downstream tasks after appropriate adaptation or prompting.
- The predictive/compression equivalence explains why next-token training yields models that can answer questions, provide factual descriptions, and perform many downstream tasks after appropriate adaptation or prompting.
Transformer architectures implement the computations of modern large language models but the internal representations remain difficult to fully interpret
Modern large language models are implemented with the Transformer neural network architecture, whose mathematical operations and forward-pass computations are well specified.
- Parameter distribution and structure
-
Billions to hundreds of billions of parameters are distributed across attention layers, feedforward networks, and embeddings.
- Optimization adjusts these parameters to reduce next-token loss but does not yield a human-interpretable decomposition of function.
-
Billions to hundreds of billions of parameters are distributed across attention layers, feedforward networks, and embeddings.
- Empirical nature and interpretability
- Consequently, models are largely empirical artifacts: their behavior is best characterized by input/output evaluation rather than full mechanistic understanding.
- This motivates the subfield of mechanistic interpretability, which attempts to map parameter substructures to functional capabilities.
- Consequently, models are largely empirical artifacts: their behavior is best characterized by input/output evaluation rather than full mechanistic understanding.
- Deployment implication
- The relative inscrutability means rigorous, scenario-specific evaluation suites and empirical safety testing are essential when deploying these systems.
- The relative inscrutability means rigorous, scenario-specific evaluation suites and empirical safety testing are essential when deploying these systems.
Production-quality conversational assistants are produced by fine-tuning pretrained base models on curated Q&A datasets
A typical two-stage training paradigm separates broad knowledge acquisition from alignment to desired behaviors.
-
Pretraining
- Performed on broad, noisy web-scale corpora to build a knowledge-rich base model.
- Performed on broad, noisy web-scale corpora to build a knowledge-rich base model.
-
Fine-tuning
- Uses smaller, much higher-quality datasets consisting of labeled examples (e.g., user prompts paired with ideal assistant responses).
- Examples are generated following detailed labeling instructions and often rely on trained annotators.
- Fine-tuning preserves factual knowledge from pretraining while shifting the model’s output distribution toward helpful, formatted, and policy-compliant assistant responses.
- It is computationally far cheaper and faster than pretraining.
- Uses smaller, much higher-quality datasets consisting of labeled examples (e.g., user prompts paired with ideal assistant responses).
- Iterative deployment cycle
- Includes monitoring, collecting failure cases, and incorporating corrected responses into subsequent fine-tuning cycles to iteratively improve behavior.
- Includes monitoring, collecting failure cases, and incorporating corrected responses into subsequent fine-tuning cycles to iteratively improve behavior.
Human comparison labels and reinforcement learning from human feedback (RLHF) provide an additional alignment stage
A further alignment stage uses pairwise comparison labels, where human raters choose the better output among model candidates rather than authoring full responses.
- Reinforcement Learning from Human Feedback (RLHF)
- These comparisons are used to train a reward model that operationalizes human preferences.
- The reward model guides policy optimization so the model prefers outputs judged superior by humans, often improving alignment beyond direct supervised fine-tuning.
- These comparisons are used to train a reward model that operationalizes human preferences.
- Labeling practices and tooling
-
Labeling instructions encode desiderata such as helpfulness, truthfulness, and harmlessness, and are typically extensive and nuanced to capture real-world safety requirements.
- Human–machine workflows accelerate labeling quality by having models propose candidates which humans curate or combine, reducing manual effort while scaling comparison collection.
-
Labeling instructions encode desiderata such as helpfulness, truthfulness, and harmlessness, and are typically extensive and nuanced to capture real-world safety requirements.
Benchmark leaderboards reveal a two-tier ecosystem of closed proprietary models and open-weight models
Public comparison platforms rank chat models using pairwise human judgments and compute ELO-style ratings to compare performance empirically.
- Current landscape
- The highest-rated models on many leaderboards are closed-source, commercially hosted systems whose weights are not publicly available; these systems often lead in aggregate capability metrics.
- The highest-rated models on many leaderboards are closed-source, commercially hosted systems whose weights are not publicly available; these systems often lead in aggregate capability metrics.
- Open-weight ecosystem
- Beneath them sits a growing open-weight ecosystem (model families released with downloadable checkpoints and papers) that trades some performance for greater transparency, modifiability, and on-premise deployment.
- Beneath them sits a growing open-weight ecosystem (model families released with downloadable checkpoints and papers) that trades some performance for greater transparency, modifiability, and on-premise deployment.
- Ecosystem dynamics
- The dynamic is one of competition and convergence: open-source projects iterate to close the gap while proprietary systems retain performance advantages and controlled deployment environments.
- The dynamic is one of competition and convergence: open-source projects iterate to close the gap while proprietary systems retain performance advantages and controlled deployment environments.
Scaling laws describe predictable improvements from increasing model size and training data
Scaling laws empirically relate next-token prediction loss to two primary variables: model parameter count and amount of training data.
- Properties and utility
- These relationships are smooth and extrapolatable over wide ranges, allowing practitioners to forecast expected predictive accuracy for a given compute and data budget.
- Because many downstream evaluation metrics correlate with next-token accuracy, scaling up model size and dataset volume has been a reliable path to capability improvements.
- These relationships are smooth and extrapolatable over wide ranges, allowing practitioners to forecast expected predictive accuracy for a given compute and data budget.
- Implication for investment
- This predictable scaling behavior helps explain industry incentives to acquire large GPU clusters and massive corpora; algorithmic progress can accelerate gains, but the scaling relationship provides a robust baseline strategy relatively independent of specific architectural innovations.
- This predictable scaling behavior helps explain industry incentives to acquire large GPU clusters and massive corpora; algorithmic progress can accelerate gains, but the scaling relationship provides a robust baseline strategy relatively independent of specific architectural innovations.
Tool use augments language models by giving them access to external computation, browsing, and visualization
Modern assistant systems are designed to orchestrate external tools rather than rely solely on internal token-based computation.
- Common tool types
- Web browsers for information retrieval
- Calculator or Python interpreters for precise numeric computation
- Plotting libraries for visualization
- Image-generation APIs for multimodal outputs
- Web browsers for information retrieval
- Interaction pattern
- The assistant issues structured tool calls or emits special tokens that indicate desired tool invocation, consumes the tool outputs as additional context, and continues generation to produce a final answer — mirroring human workflows.
- The assistant issues structured tool calls or emits special tokens that indicate desired tool invocation, consumes the tool outputs as additional context, and continues generation to produce a final answer — mirroring human workflows.
- Benefits and implementation considerations
- Tool use expands reliability and capability because deterministic external systems (e.g., a calculator or web search) provide accurate inputs and let the language model act as an orchestrator coordinating heterogeneous resources.
- Implementations require careful interface design, grounding of tool outputs, and citation/provenance tracking to maintain trustworthiness.
- Tool use expands reliability and capability because deterministic external systems (e.g., a calculator or web search) provide accurate inputs and let the language model act as an orchestrator coordinating heterogeneous resources.
Multimodality—image and audio understanding/generation—extends model capabilities beyond text and enables new applications
Multimodal models ingest and produce modalities such as images, audio, and video alongside text, enabling a wide range of tasks.
- Example capabilities
- Code generation from diagram images, fine-grained image captioning, speech-to-text and text-to-speech conversational interfaces, and multimodal content creation.
- Code generation from diagram images, fine-grained image captioning, speech-to-text and text-to-speech conversational interfaces, and multimodal content creation.
- Integration approach
- Integration requires joint training and modality-specific encoders that project diverse inputs into a shared representation space for cross-modal reasoning.
- Integration requires joint training and modality-specific encoders that project diverse inputs into a shared representation space for cross-modal reasoning.
- Trade-offs and deployment
- Multimodality broadens usability for real-world workflows (for example generating a functioning website from a hand-drawn sketch) but increases system complexity by adding new tool and data pipelines and new evaluation metrics for correctness across modalities.
- Practical deployment requires modality-specific safety checks and preprocessing to mitigate risks unique to non-text inputs.
- Multimodality broadens usability for real-world workflows (for example generating a functioning website from a hand-drawn sketch) but increases system complexity by adding new tool and data pipelines and new evaluation metrics for correctness across modalities.
System 1 versus system 2 describes desiderata for time-dependent deliberation and improved reasoning
The system 1 / system 2 distinction frames a capability spectrum:
-
System 1: fast, intuitive token-level prediction (current autoregressive models operate here).
-
System 2: slower, deliberative reasoning that trades time for accuracy and supports multi-step planning and evaluation.
- Current limitations and research directions
- Autoregressive models produce each token with roughly uniform compute and lack native mechanisms for multi-step internal deliberation or explicit tree-search over reasoning trajectories.
- Research aims to enable models to spend longer deliberation time and to organize intermediate reasoning (for example chain-of-thought or tree-of-thoughts algorithms) so accuracy increases monotonically with invested computation.
- Achieving practical system 2 behavior requires mechanisms for maintaining and manipulating intermediate state, planning, and multi-step evaluation, often leveraging tool use and external deterministic modules to improve reliability.
- Autoregressive models produce each token with roughly uniform compute and lack native mechanisms for multi-step internal deliberation or explicit tree-search over reasoning trajectories.
Self-improvement via reinforcement and sandboxed evaluation is possible in closed environments but remains challenging for general language tasks
Self-improvement in closed domains often follows the AlphaGo pattern: imitate experts initially, then surpass them by optimizing directly against an automated reward function that is cheap to evaluate (for example win/loss in a game).
- Limits for general language tasks
- For general language tasks there is no single universally available, cheap reward signal analogous to game outcomes, which complicates automated self-improvement at scale.
- Narrow domains with reliable, automatically computable reward functions may permit iterative self-play or objective-driven optimization to exceed human-level performance, but generalizing this approach to open-ended language tasks remains an active research question.
- For general language tasks there is no single universally available, cheap reward signal analogous to game outcomes, which complicates automated self-improvement at scale.
- Current approaches
- Progress toward self-improving LLMs focuses on constrained tasks, surrogate reward models, or human-in-the-loop reinforcement signals rather than fully autonomous universal improvement.
- Progress toward self-improving LLMs focuses on constrained tasks, surrogate reward models, or human-in-the-loop reinforcement signals rather than fully autonomous universal improvement.
Customization enables specialized expert models via retrieval, custom instructions, and fine-tuning
Customization lets teams tailor a general base model to specific tasks, domains, or organizational knowledge through several mechanisms:
- Mechanisms for customization
-
Custom instructions that alter behavior.
-
Retrieval-augmented generation (RAG) that surfaces user-provided documents at runtime and keeps the base weights unchanged.
-
Supervised fine-tuning on domain-specific example pairs to produce task experts.
-
Custom instructions that alter behavior.
- Product patterns
- Product efforts like user-configurable GPTs or app-store paradigms encapsulate these levers so developers can compose assistants with specified behavior and knowledge.
- Deeper customization (targeted fine-tuning or prompt engineering) can create high-performing, task-specific experts that outperform generic assistants on narrow workloads.
- Product efforts like user-configurable GPTs or app-store paradigms encapsulate these levers so developers can compose assistants with specified behavior and knowledge.
Large language models can be conceived as kernel processes of an emerging ‘LLM operating system’ that orchestrates memory, tools, and agents
An LLM-centered system functions analogously to an operating system kernel that coordinates storage, working memory, and peripheral services.
- Analogies
- The Internet and document stores act as mass storage.
- The context window acts as working memory or RAM.
- External tools are analogous to system calls or device drivers.
- The Internet and document stores act as mass storage.
- Management patterns
- The context window is a finite, high-value resource that the LLM orchestrator must page in and out relevant content from larger stores via retrieval and summarization to solve tasks efficiently.
- Multithreading equivalents appear in parallel tool invocation and speculative generation; user-space/kernel-space analogies map to user prompts versus system-level instruction contexts.
- The context window is a finite, high-value resource that the LLM orchestrator must page in and out relevant content from larger stores via retrieval and summarization to solve tasks efficiently.
- Design implication
- Viewing LLMs as orchestration layers clarifies patterns for tool integration, memory management, and extensibility in complex application stacks built on language models.
- Viewing LLMs as orchestration layers clarifies patterns for tool integration, memory management, and extensibility in complex application stacks built on language models.
LLM deployments introduce a novel security attack surface including jailbreaks, adversarial encodings, and adversarial artifacts in images
Language models open new vectors of compromise because they accept rich, flexible inputs and execute semantics learned from heterogeneous web data; attackers exploit this by crafting inputs or encodings that bypass refusal logic.
- Attack types and examples
-
Jailbreaks achieved via roleplay or instruction framing that cause a model to violate safety policies.
-
Adversarial encodings (for example base64 or optimized token suffixes) that hide malicious intent from detectors trained on natural-language refusals.
-
Visual adversarial patterns embedded in images can act as conditioned triggers when models process multimodal inputs; researchers can optimize such patterns to reliably elicit undesired behavior.
-
Jailbreaks achieved via roleplay or instruction framing that cause a model to violate safety policies.
- Defensive needs
- Defending requires comprehensive, multimodal adversarial training, sanitization pipelines, and robust multi-language and multi-encoding safety datasets rather than English-only refusal examples.
- Defending requires comprehensive, multimodal adversarial training, sanitization pipelines, and robust multi-language and multi-encoding safety datasets rather than English-only refusal examples.
Prompt injection attacks hijack model behavior by embedding instructions within retrieved documents or media and can enable phishing and data exfiltration
Prompt injection occurs when content that the model retrieves or receives contains instructions that override or influence the intended prompt context, causing the model to reveal sensitive information, follow malicious directives, or publish attacker-controlled links.
- Attack vectors and examples
- Invisible or faint text in images that instruct the model to produce spam.
- Web pages that include malicious instructions which appear during browsing.
- Shared documents that embed executable payloads to exfiltrate data.
- Invisible or faint text in images that instruct the model to produce spam.
- Exploited behaviors
- Attackers exploit downstream processing behaviors (e.g., rendering external images, embedding URLs, or converting content that executes within trusted domains) to bypass content security policies and cause the model to leak data to attacker-controlled endpoints.
- Attackers exploit downstream processing behaviors (e.g., rendering external images, embedding URLs, or converting content that executes within trusted domains) to bypass content security policies and cause the model to leak data to attacker-controlled endpoints.
- Mitigations
- Combine strict input sanitization, provenance-aware retrieval filters, conservative rendering policies, and runtime behavior constraints that separate untrusted content from system-level instructions.
- Combine strict input sanitization, provenance-aware retrieval filters, conservative rendering policies, and runtime behavior constraints that separate untrusted content from system-level instructions.
Data poisoning and backdoor trigger phrases can corrupt models during training or fine-tuning, producing conditional failures
A data poisoning attack injects adversarially crafted examples into training or fine-tuning data so the trained model behaves maliciously or incorrectly when presented with specific trigger phrases or contexts.
- Behavior of backdoors
- These backdoors act like sleeper agents: the model behaves normally most of the time, but the presence of a trigger token or phrase (for example an injected phrase used during fine-tuning) causes the model to output nonsensical or attacker-chosen content and to bypass safety checks.
- These backdoors act like sleeper agents: the model behaves normally most of the time, but the presence of a trigger token or phrase (for example an injected phrase used during fine-tuning) causes the model to output nonsensical or attacker-chosen content and to bypass safety checks.
- Attack surface and ease
- Poisoning is easier to mount when training corpora are scraped from uncontrolled web sources or when fine-tuning allows external contributors.
- Poisoning is easier to mount when training corpora are scraped from uncontrolled web sources or when fine-tuning allows external contributors.
- Defenses
- Dataset provenance tracking, robust data filtering, differential validation on held-out clean sets, and forensic auditing to detect anomalous correlations.
- While documented attacks are more common in fine-tuning scenarios, studying pretraining-level risks and developing secure data curation pipelines is an active area of research.
- Dataset provenance tracking, robust data filtering, differential validation on held-out clean sets, and forensic auditing to detect anomalous correlations.
Enjoy Reading This Article?
Here are some more articles you might like to read next: