LLM Series - Part 5 - System Design for LLMs

In this blog post, I will discuss the system design for applications that use LLMs as a core component. However, the goal is to prepare for a technical interview rather than to build a real product :D.

Fundamental Concepts

Retriever-augmented generation (RAG)

Prompt Engineering

FAISS

Indexing Vectors

FAISS creates an index to store and organize vectors efficiently. The indexing method affects performance:

Flat Index (IndexFlatL2) → Exact k-NN search (slow but accurate).
IVF (Inverted File Index) → Faster search with approximate results.
HNSW (Hierarchical Navigable Small World) → Graph-based ANN search (fast & accurate).

import faiss
import numpy as np

# Generate random 512-dimension vectors for products
dimension = 512
num_vectors = 10000
vectors = np.random.random((num_vectors, dimension)).astype('float32')

# Create a FAISS index
index = faiss.IndexFlatL2(dimension)  # L2 distance (Euclidean)
index.add(vectors)  # Add vectors to the index

Searching for Similar Items

# Generate a random query vector
query_vector = np.random.random((1, dimension)).astype('float32')

# Find the top 5 nearest neighbors
k = 5
distances, indices = index.search(query_vector, k)

print("Nearest Neighbors:", indices)
print("Distances:", distances)

Core components of an LLM system

Input handling and processing

Because each application has different input types, for example, code will differ from clinical notes, we need to have a module that can handle specific input types related to the application.

Knowledge Base and Data Resources

The purpose of this module is to:

Provide the necessary related knowledge to the LLM.
To store user-specific data for personalized output/decision.

This can be done by:

Vector database tools like Pinecone or Faiss.
Flat file system.

The core LLM-powered module

This is the main module that uses the LLM to generate the output/decision.

Prompting module

This can be integrated into the core LLM-powered module to improve the quality of the output/decision. We can have a sub-module to classify the input into different categories and use different prompt templates for each category.

We can also leverage the response from the LLM as well as the user feedback to improve the prompt.

Filtering and Validation

Validation by rule-based logic check, to make sure the output/input is correct and valid. However, because of the rule-based nature, it is not always flexible to handle all cases.
Validation by another machine learning model, for example, another LLM model or a uncertainty estimation model.
Optional human-in-the-loop (HITL) validation by a human expert, especially in critical applications like medical diagnosis.

Safe Guarding

Agentic Framework

Agent Tools

These are external tools or resources that the LLM can access to perform specitic actions or gather information. This could be a calculator, a search API, or a external database.

Multi-agent system

Data Distribution Shifts and Monitoring

Continual Learning

Evaluation

Offline Evaluation

Test in Production

Deployment and Scaling

Case Study

Discuss on User Journey

User journey describes how a user interacts with a product, starting from the first touchpoint - i.e.,user’s input, to the final interaction - i.e., displaying the result to the user.

Discussing on the user journey helps us to understand the flow interaction between the user and the product, and identify the core components that are involved in the interaction.

Recommender System

User Interaction Layer

Chat Interface: Users can describe their interests, ask questions, and discuss product features.
Input Handling: Supports text-based and voice-based interactions.

NLP Module

Intent Recognition: Extracts user intent (e.g., “I want a lightweight laptop for travel”).
Entity Extraction: Identifies key product attributes (e.g., “lightweight,” “laptop,” “travel”).
Sentiment Analysis: Understands user sentiment to refine recommendations.

Product Database

Structure: Contains product details, including:
- Name, Category, Price
- Features & Specifications
- User Reviews & Ratings

The most important component of this module is the embedding model to convert all the data into vectors so that we can use the vector database to store and search for similar items. The ideal vector space should be able to capture the semantic meaning of the data, i.e., the more similar the data is, the closer the vectors are in the vector space.

The most commonly used embedding models fall into three categories:

1️⃣ General-Purpose Text Embeddings (Best for Q&A, knowledge retrieval)

2️⃣ Domain-Specific Embeddings (Optimized for medical, legal, code, etc.)

3️⃣ Multimodal Embeddings (For text + images)

Recommendation Engine

Content-Based Filtering: Matches user preferences with product attributes.
Collaborative Filtering: Uses customer behavior data to suggest items others with similar preferences liked.
Hybrid Approach: Combines content-based and collaborative filtering.

RAG-based recommendation

RAG is built on two main components:

1️⃣ Retriever (Information Fetching)

Dense Vector Search (FAISS, Annoy, Pinecone, Weaviate, ChromaDB) that uses embeddings (e.g., BERT, SBERT, DPR) to find semantically similar documents.
Traditional Search (BM25, ElasticSearch, Google Search API) that retrieves documents using keyword-based matching.

2️⃣ Generator (Text Generation)

Pre-trained LLMs (GPT, BART, T5, LLaMA) generate responses using the retrieved documents as additional context.
Can use fine-tuned models for domain-specific responses (e.g., finance, medical).

Key Steps in RAG:

1️⃣ User Input → A query is given (e.g., “What are the latest gaming laptops?”).

2️⃣ Retrieval Module → Finds the most relevant documents using vector search (FAISS, BM25, ElasticSearch, etc.).

3️⃣ Context Injection → The retrieved documents are passed to the generation model.

4️⃣ Response Generation → The model generates a final, coherent answer using both the query and retrieved documents.

References

[1] Build your first LLM agent application: https://developer.nvidia.com/blog/building-your-first-llm-agent-application/