Course introduction and objectives

The course introduces an open-source project for building an AI agent simulation engine that brings historical figures to life inside an interactive game environment.

It emphasizes end-to-end engineering practices beyond pure model development, including:

  • Robust memory systems with MongoDB for short- and long-term state
  • Agentic workflow orchestration using Langraph
  • LLM inference via Grok (with Llama 3 37B used for dialogs)
  • Deployment with FastAPI and WebSockets for real-time communication
  • Observability and LM-Ops tooling for tracing, evaluation, and monitoring

The curriculum targets production-ready concerns such as:

  • API / UX integration and prompt/version management
  • Containerization with Docker and local/cloud deployment practices
  • Monitoring and reliability for real-world usage

Participants gain a complete stack demonstration (deployable agentic applications) rather than isolated toy examples.


Course lesson plan and structure

The course is organized into a sequence of lessons that each focus on a specific system layer:

  • Architecture & UI / API design — overall system separation and responsibilities
  • Agent workflow construction with Langraph — graph-based agent orchestration
  • Short-term & long-term memory design (MongoDB) — persistence and retrieval strategies
  • Real-time API integration (FastAPI + WebSockets) — streaming and low-latency interaction
  • LM-Ops evaluation and monitoring (Opic) — tracing, prompt/versioning, and metrics

Each lesson includes practical artifacts to support hands-on learning:

  • Code, Jupyter notebooks, and guided exercises
  • Local-first replication steps and cloud deployment pointers
  • A modular structure that supports incremental validation of each component

Interactive simulation demo and learning motivation

An interactive demo motivates the engineering concepts by showing AI agents impersonating philosophers in a browser-based game:

  • Players interact with NPC philosophers (e.g., Plato, Aristotle, Turing) in a village scene
  • The demo highlights core techniques: memory, retrieval-augmented generation (RAG), workflow orchestration, and real-time streaming
  • Agents are grounded in authoritative sources to produce richer, historically coherent dialogues
  • The demo sets expectations for the end-to-end learning outcome: a simulation that is fun, interactive, and technically realistic

Lesson 1 introduction and architecture overview

Lesson 1 gives a high-level overview of the Fellow Agents architecture and full tech stack used across the course:

  • Architectural separation:
    • Online phase — real-time gameplay and agent inference
    • Offline phase — data ingestion, feature pipeline, and evaluation dataset generation
  • Key runtime components:
    • Phaser game UI for in-browser interaction
    • FastAPI server for agent serving and WebSocket streaming
    • Langraph workflows for agent behavior orchestration
    • MongoDB for short-term checkpoints and long-term vector memory
  • The overview maps each engineering decision to a concrete system responsibility and orients subsequent lessons

Offline pipeline and long-term memory population

The offline phase implements a RAG feature pipeline that prepares grounded context for each philosopher:

  1. Extract contextual data from authoritative sources (Wikipedia, Stanford Encyclopedia of Philosophy)
  2. Chunk the text (overlapping pieces) and apply deduplication heuristics
  3. Produce embeddings for each chunk
  4. Store vectors and metadata in MongoDB as long-term memory (vector index / hybrid search)

These offline artifacts are reused to:

  • Assemble evaluation datasets
  • Ensure agent responses can be grounded in verifiable historical context

Details such as embedding model choice, chunking strategy, and storage schema are central to RAG effectiveness and grounding.


Evaluation dataset generation and Opic integration

The generate / eval dataset component produces question-and-answer datasets per philosopher to enable objective evaluation of RAG behavior:

  • Generated datasets exercise the retrieval pipeline and surface regressions or hallucinations
  • Opic (observability/evaluation tool) is integrated to:
    • Host datasets and traces
    • Version prompts and evaluation configs
    • Run automated evaluations comparing agent responses to gold outputs
  • This setup enables iterative improvement via metrics-driven validation of the RAG pipeline and agent workflows

Runtime components: UI, API, agentic layer and LLM gateway

The online phase orchestrates interaction between three main runtime components:

  • Game UI (Phaser) — user actions map to API calls
  • FastAPI server — receives UI calls and invokes Langraph agent workflows
  • Memory / agent stack — short-term state + long-term retrieval tools in MongoDB

Runtime behavior:

  1. FastAPI invokes a Langraph-defined workflow that binds prompts, tools, and an LLM gateway
  2. The workflow consults short-term state and conditionally calls long-term retrieval tools (RAG)
  3. Grok (with Llama 3 37B in dialogs) serves as the LLM provider for streaming responses

Key production concerns: prompt management, retrieval tool binding, state persistence, and streaming to the UI.


Three-component flow and tool-enabled response example

A simplified three-component flow highlights conditional tool usage and streaming:

  1. The UI sends a message to FastAPI
  2. FastAPI invokes the Langraph workflow (agent graph)
  3. The agent evaluates whether to use a retrieval tool (conditional decision)
  4. If needed, the tool queries MongoDB long-term memory and returns ranked chunks
  5. The LLM (Grok/Llama 3 37B) generates a response which streams back to the UI in partial chunks

This flow demonstrates how conditional retrieval, streaming responses, and tool orchestration enable grounded, context-rich replies in a real-time game.


Repository layout, cloning and development environment

The project repository contains two core components:

  • filagents-api (Python)
    • Implements the agentic backend with a clean architecture (application, domain, infrastructure layers)
    • Includes Docker files, notebooks, and evaluation data
  • filagents-ui (Phaser JavaScript)
    • Phaser 3 project with scenes, dialog management, and HTTP / WebSocket services

Developer onboarding checklist:

  • Clone the repo and open in an IDE
  • Create a Python virtual environment for the API
  • Inspect distinct modules and follow installation/run instructions provided in the repo

Installing dependencies, environment variables and local infra

Local setup and infrastructure:

  • Prerequisites: Python 3.11, Git, Docker, plus project-specific packages
  • Create and activate a virtual environment, then install dependencies from requirements
  • Configure environment variables:
    • Copy example .env -> .env and set keys for Grok, OpenAI (for Opic), and Comet
  • Start local infrastructure via Make (make infrastructure_app) which launches three Docker services:
    • Local MongoDB (dev Atlas emulation), the FastAPI backend, and the Phaser UI

This composition supports local development and testing without requiring external managed services.


Game UI walkthrough and interactive agent demo

Phaser-based UI mechanics and demo features:

  • Player controls: movement with arrow keys, speak via spacebar + input, close dialogs with Escape
  • Multiple philosopher NPCs implemented as Langraph-driven agents, each with distinct personalities and topics (ethics, computation, AI)
  • Interacting with a philosopher triggers the agent backend and shows streamed responses in the dialog box
  • Demo includes both comedic easter eggs and realistic philosophical Q&A to verify the end-to-end pipeline from user input to agent response

UI internals: dialogue manager, WebSocket service and API binding

Client-side communication and dialog orchestration are organized as follows:

  • Dialogue manager — orchestrates dialog boxes, tracks the active philosopher, and routes incoming WebSocket messages
  • WebSocket API service — manages connection lifecycle, send/receive semantics, and callback registration; connects to ws://localhost:8000 for streaming
  • The client assembles streamed chunks into full responses and integrates with Phaser scenes for rendering
  • Architecture decouples UI rendering from networking logic to simplify testing and extension

Philosopher domain model, prompt templates and state checkpointing

Philosopher identity and persistence model:

  • Philosophers modeled as domain objects (Pydantic models) with fields:
    • id, name, perspective, style, and character prompts
  • Character prompts are assembled from domain fields to produce a system prompt that conditions personality and voice
  • Langraph graph state persists conversation history and philosopher-specific attributes (context, summary, etc.)
  • The FastAPI backend configures a Langraph checkpointer that writes state snapshots into MongoDB collections (checkpoints, writes)
  • Persisted state enables short-term continuity (recalling user facts) and per-agent thread isolation across interactions

Langraph Studio visualization and conversation node behavior

Langraph Studio visualizes the agent workflow as a directed graph:

  • Start node → Conversation node, where a tool condition decides whether to call the retriever (conditional/dotted edges)
  • When retrieval is triggered:
    • Returned context is summarized and injected back into the conversation loop
  • Connector and summarization nodes implement architecture-level concerns:
    • Token compression, flow control, and context summarization
  • Visual graphs clarify the runtime decision-making and iterative loops present in agentic workflows

Implementation of nodes, chains and RAG loop in code

Graph composition and node responsibilities:

  • Nodes created include:
    • Conversation (conversation chain binding LLM, prompts, tools)
    • Retriever (MongoDB hybrid retriever wrapped as a Langraph tool node)
    • Context summarizer, conversation summarizer, and a transparent connector node
  • Edges implement conditional RAG loops:
    • conversation → retriever → summarize context → conversation
    • Additional conditional edge summarizes conversations when message length exceeds a threshold (e.g., 30 messages)
  • The conversation node binds Grok / Llama 3 37B, prompts, and tools to enable streaming and tool orchestration

Short-term memory concept and storage model

Short-term memory design (conversation checkpointing):

  • Conversation history is stored in the Langraph graph state as a messages list representing chat history
  • The messages state is extended with philosopher-specific attributes (context, name, perspective, style, summary)
  • An async MongoDB saver acts as a Langraph checkpointer to persist state snapshots to MongoDB collections
  • Persisted state enables agents to recall user-provided facts across turns (e.g., the user’s name) and maintain coherent multi-turn dialogues
  • Per-agent thread IDs ensure multiple philosopher states remain isolated

Notebook demo comparing no-memory vs persisted memory

Notebook examples show persistence vs. stateless invocation:

  1. generate_response_without_memory — runs the graph without a checkpointer (stateless); the agent forgets prior user turns
  2. generate_response_with_memory — attaches an async MongoDB checkpointer using philosopher ID as the thread ID; the agent recalls earlier facts across invocations

The notebook reproduces the same graph invocation logic and highlights how simple database-backed checkpoints restore chat continuity per philosopher thread.


Long-term memory purpose and ingestion pipeline

Long-term memory and ingestion pipeline for grounded context:

  • Long-term memory stores biographies, philosophical ideas, and domain facts per philosopher
  • Ingestion pipeline steps:
    1. Download documents from Wikipedia and Stanford Encyclopedia of Philosophy
    2. Apply a recursive character splitter to produce overlapping chunks
    3. Deduplicate chunks using content-similarity heuristics (MinHash-style or others)
    4. Produce embeddings per chunk
    5. Store vectors + metadata into MongoDB’s vector index

This approach supports retrieval of source-attributed context during online queries, enabling historically accurate agent responses.


Building the long-term memory toolchain and persistent index

CLI and retriever integration:

  • Repository includes a CLI (create_long_term_memory) that orchestrates extraction, chunking, deduplication, embedding generation, and insertion into MongoDB
  • A hybrid MongoDB retriever (Langraph / LangChain integration) is constructed using a chosen embedding model and MongoDB Atlas hybrid search or local vector features
  • After ingestion, the philosopher_launcher_memory collection contains chunked documents with source attribution suitable for retrieval tools
  • The retriever is exposed as a Langraph tool node for conditional invocation in the agent workflow

Runtime retrieval behavior and example queries

Runtime retrieval and context injection:

  • The retrieve_philosopher_context node queries MongoDB’s vector index with the current user question
  • Returned chunks are summarized to reduce token usage and then injected back into the conversation chain
  • Retriever returns ranked chunks from multiple sources (Wikipedia, Stanford Encyclopedia of Philosophy) and the conversation node may trigger additional retrieval iterations (retrieval loop)
  • Notebook and UI examples demonstrate queries (e.g., “Turing machine”, “Chinese room argument”) and show retrieved chunks with source metadata to confirm grounding

WebSockets rationale for real-time agentic systems

Why WebSockets are used for UI ↔ backend communication:

  • Persistent, bidirectional connections enable low-latency interaction and streaming partial responses
  • Advantages over HTTP:
    • No per-interaction handshake overhead
    • Support for server-to-client pushes and true streaming of partial LLM outputs
    • Better fit for interactive game experiences and scalable multiplayer scenarios
  • WebSockets are therefore the preferred protocol to stream Langraph response chunks to the UI as they are produced

FastAPI WebSocket implementation and client integration

FastAPI backend WebSocket behavior and client-side handling:

  • FastAPI exposes both HTTP and WebSocket endpoints
  • WebSocket endpoint workflow:
    1. Accept connection and receive JSON payloads from the client
    2. Invoke the Langraph streaming graph (graph.stream) for partial outputs
    3. Send an initial “streaming started” message
    4. Stream partial chunks as JSON messages while graph produces them
    5. Send a final message with streaming=false and the assembled full response
  • The Phaser client implements a WebSocket service that manages handshake, chunk assembly, callbacks, and disconnect logic to enable real-time rendering of streaming agent responses

LM-Ops definition and major components

LM-Ops fundamentals for production LLM systems:

LM-Ops is the set of practices, tools, and techniques to optimize the production lifecycle of LLM-based systems. Core components include:

  1. Model deployment — packaging and serving model binaries and inference endpoints
  2. Data management — datasets for training, evaluation, and reproducibility
  3. Prompt versioning — tracking prompt edits like code/version control
  4. Monitoring & observability — traces, token usage, latency, tool-call telemetry
  5. Security — privacy, guardrails, and access control
  6. Evaluation — metrics, benchmarking, and automated tests

A production agentic system requires processes in each area to ensure safety, reliability, and continuous improvement.


Prompt versioning workflow with Opic

Prompt versioning and Opic integration:

  • Treat prompts as versioned artifacts analogous to code and models
  • Use Opic to store, name, and version prompts centrally
  • Code maps Opic prompt objects to domain prompt templates (e.g., philosopher_character_card) and writes prompt updates to Opic on deployment/run
  • Opic’s prompt library provides a history of versions so teams can:
    • Track changes and attribute behavioral shifts to prompt edits
    • Roll back to prior prompt states when needed
  • Langraph chains fetch prompt content or version metadata as part of the agent configuration

Monitoring and observability via Opic traces

Tracing Langraph executions with Opic:

  • Attach an Opic tracer as a callback to the compiled Langraph graph so each execution emits trace spans and metadata
  • Traces capture:
    • Node-level runtimes (start node, conversation node, retriever usage)
    • Prompt inputs and model selections
    • Tool invocations, durations, and latency metrics
  • Instrumentation enables:
    • Per-step performance analysis and error tracing
    • Correlation of prompt/retriever changes with downstream metrics
    • Diagnostics for regressions and optimization opportunities

Evaluation dataset generation pipeline using a large LLM

Generating evaluation datasets via synthetic grounded conversations:

Pipeline to create evaluation corpus:

  1. Select chunk subsets from the philosopher knowledge corpus
  2. Use a large LLM (Grok / Llama 3 37B) to synthesize multi-turn, grounded conversations given sampled chunks
  3. Validate generated conversations for structure and fidelity
  4. Save synthesized conversations as JSON to serve as the automated evaluation corpus

This synthetic-but-grounded dataset exercises retrieval quality and downstream agent behavior in automated tests.


Automated evaluation metrics and Opic-driven scoring

Automated evaluation workflow and metrics in Opic:

  • Opic runs automated evaluations using an external judge model (OpenAI) to score five metrics:
    • Hallucination (0.0–1.0; 1.0 = fully grounded) — measures if the response is supported by sources
    • Answer relevance — relevance of the response to the question and context
    • Moderation — toxicity / safety scoring
    • Context precision — proportion of retrieved context that is relevant
    • Context recall — proportion of relevant context that was retrieved
  • Evaluation process:
    1. Upload dataset to Opic
    2. Invoke an evaluation job that executes agent responses for each sample
    3. Use prompt-based LLM judgment to compute metrics and per-sample traces
  • Results surface as experiments in Opic with aggregate metrics, timelines, and per-sample traces to guide iterative improvements of prompts, retrievers, and workflows