LLM Series - Part 2 - Common Implementations in LLMs

Transformer in Pytorch
- Attention Mechanism and Position Embedding
- How to implement a Transformer model in Pytorch?
Fine-tune an LLM with LangChain
Langchain Agents for Chatbots
Integrating RAG with LangChain
Train an LLM with LoRA & QLoRA
Implement a Custom Tokenizer
- What is a tokenizer?
- How to implement a custom tokenizer?
Build a Chatbot with Ollama
- Ollama
- Create a Python Chatbot with Ollama

Transformer in Pytorch

Attention Mechanism and Position Embedding

Cross-Attention and Self-Attention

Aspect	Self-Attention	Cross-Attention
Definition	Focuses on relationships between elements of the same sequence (e.g., within a sentence).	Focuses on relationships between elements of one sequence and another sequence (e.g., query and context).
Inputs	Single sequence (e.g., the same sequence is used for queries, keys, and values).	Two sequences: one provides queries, and the other provides keys and values.
Purpose	Captures intra-sequence dependencies, helping the model understand context within the same sequence.	Captures inter-sequence dependencies, aligning information between different sequences.
Key Benefit	Helps the model understand contextual relationships within a sequence.	Enables the model to incorporate external information from another sequence. Very important for multi-modal tasks.

Note that in the encoder, we only use self-attention. In the decoder, we use cross-attention to attend to the encoder’s output.

Details of Attention Mechanism

Below is the table of the operations and dimensions of the attention mechanisms including the original and two additional operations proposed in our paper (KPOP).

Operation	Original Operation	Original Dim	Concatenative Operation	Concatenative Dim	Additive Operation	Additive Dim
Q	W_qZ	b×m_z×d	W_qZ	b×m_z×d	W_qZ	b×m_z×d
K	W_kC	b×m_c×d	W_kcat(C,repeat(p,b))	b×(m_c+m_p)×d	W_k(C+repeat(p,b))	b×m_c×d
V	W_vC	b×m_c×d	W_vcat(C,repeat(p,b))	b×(m_c+m_p)×d	W_v(C+repeat(p,b))	b×m_c×d
A	σ(QK^T/√d)	b×m_z×m_c	σ(QK^T/√d)	b×m_z×(m_c+m_p)	σ(QK^T/√d)	b×m_z×m_c
O	AV	b×m_z×d	AV	b×m_z×d	AV	b×m_z×d

Note: cat(), repeat(·,b), σ() represent the concatenate (at dim=1), repeat an input b times (at dim=0) and the softmax operations (at dim=2), respectively.

In the table, Z is the input sequence (i.e., the query), C is the context sequence that should be attended to. In the self-attention, Z and C are the same.

Why Multi-Head Attention?

Multi-head attention is a key component in Transformer models and is used to enhance the model’s ability to capture different types of relationships and patterns in the input data.

Learning Different Representations: Each “head” in multi-head attention operates independently, using its own set of learned weight matrices. This allows each head to focus on different parts of the input sequence or on different types of relationships within the sequence.
Dimensionality Flexibility: Each attention head operates on a reduced-dimensional subspace of the input embeddings, as the total embedding dimension is split across all heads. This division reduces the computational cost of individual attention heads, while the aggregation of all heads retains the full expressiveness of the original dimensionality.

Position Embedding

For position \(pos\) and dimension \(i\) in the embedding:

\[PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\] \[PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]

Where:

\(pos\) is the position in the sequence (0 to max_len-1)
\(i\) is the dimension index (0 to d_model/2)
\(d_{model}\) is the embedding dimension

This creates a unique position encoding for each position in the sequence using alternating sine and cosine functions at different frequencies.

How to implement a Transformer model in Pytorch?

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = pe.unsqueeze(0)
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1), :].to(x.device)

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.qkv_proj = nn.Linear(d_model, d_model * 3)
        self.fc = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None, encoder_output=None):
        batch_size, seq_length, d_model = x.size()
        qkv = self.qkv_proj(x).reshape(batch_size, seq_length, 3, self.num_heads, self.d_k)
        q, k, v = qkv.permute(2, 0, 3, 1, 4)  # (batch, num_heads, seq_length, d_k)
        
        if encoder_output is not None:
            k = encoder_output
            v = encoder_output
        
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_output = torch.matmul(attn_weights, v)
        
        attn_output = attn_output.transpose(1, 2).reshape(batch_size, seq_length, d_model)
        return self.fc(attn_output)

class FeedForward(nn.Module):
    def __init__(self, d_model, hidden_dim):
        super(FeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, d_model)
    
    def forward(self, x):
        return self.fc2(F.relu(self.fc1(x)))

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, hidden_dim, is_decoder=False):
        super(TransformerBlock, self).__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.ff = FeedForward(d_model, hidden_dim)
        self.norm2 = nn.LayerNorm(d_model)
        self.is_decoder = is_decoder
        if is_decoder:
            self.cross_attn = MultiHeadAttention(d_model, num_heads)
            self.norm3 = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None, encoder_output=None):
        x = self.norm1(x + self.attn(x, mask))
        if self.is_decoder and encoder_output is not None:
            x = self.norm3(x + self.cross_attn(x, encoder_output=encoder_output))
        x = self.norm2(x + self.ff(x))
        return x

class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, hidden_dim, max_len=5000):
        super(Transformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        self.encoder_layers = nn.ModuleList([TransformerBlock(d_model, num_heads, hidden_dim) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([TransformerBlock(d_model, num_heads, hidden_dim, is_decoder=True) for _ in range(num_layers)])
        self.fc_out = nn.Linear(d_model, vocab_size)
    
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        src = self.embedding(src)
        src = self.pos_encoding(src)
        for layer in self.encoder_layers:
            src = layer(src, src_mask)
        
        tgt = self.embedding(tgt)
        tgt = self.pos_encoding(tgt)
        for layer in self.decoder_layers:
            tgt = layer(tgt, tgt_mask, encoder_output=src)
        
        return self.fc_out(tgt)

Fine-tune an LLM with LangChain

Workflow:

Load the model and tokenizer
Load the dataset
Create a pipeline
Define the training arguments
Train the model

# Load the model and tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"  # Example model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Load the dataset
from datasets import load_dataset

dataset = load_dataset("json", data_files="data.json")

# Create a pipeline
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe)

# Define the training arguments
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./finetuned_model",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    logging_dir="./logs"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
)

trainer.train()

# Save the model
model.save_pretrained("./finetuned_llm")
tokenizer.save_pretrained("./finetuned_llm")

# Load the model
from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline.from_pretrained("./finetuned_llm")
response = llm("Explain transformers in NLP")
print(response)

Langchain Agents for Chatbots

# load the model

from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

# Load fine-tuned model and tokenizer
model_path = "./finetuned_llm"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Create a HuggingFace pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe)

# create a langchain agent

from langchain.agents import initialize_agent
from langchain.tools import Tool
from langchain.memory import ConversationBufferMemory

# Define a simple tool (e.g., search or custom function)
def custom_tool(input_text):
    return f"Processing input: {input_text}"

tools = [Tool(name="CustomTool", func=custom_tool, description="A simple tool for processing text")]

# Initialize memory for conversation history
memory = ConversationBufferMemory(memory_key="chat_history")

# Create an agent using the fine-tuned model
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="zero-shot-react-description",
    verbose=True,
    memory=memory,
)

# Test the agent
response = agent.run("What is the difference between GPT and BERT?")
print(response)

Integrating RAG with LangChain

Workflow:

Load and split documents and store embeddings in FAISS
Create a Retrieval-Augmented Chain
(Optional) Deploy as a Chatbot using FastAPI

from langchain.document_loaders import TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load and split documents
loader = TextLoader("knowledge.txt")  # Your text data
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

# Use HuggingFace embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Store embeddings in FAISS
vectorstore = FAISS.from_documents(docs, embedding_model)

from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores import FAISS

# Load your fine-tuned model
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_path = "./finetuned_llm"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe)

# Set up the retriever
retriever = vectorstore.as_retriever()

# Create a retrieval-augmented Q&A pipeline
qa_chain = RetrievalQA(llm=llm, retriever=retriever)

# Ask a question
query = "What is LangChain?"
response = qa_chain.run(query)
print(response)

Train an LLM with LoRA & QLoRA

Workflow:

Load the model (with 4-bit quantization if QLoRA is used)
Apply LoRA or QLoRA with the peft library
Train the model

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "meta-llama/Llama-2-7b-hf"  # Example: LLaMA-2 7B

# Enable 4-bit quantization (for QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16"
)

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")

Apply LoRA or QLoRA with the peft library

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,               # Rank of LoRA matrices
    lora_alpha=32,      # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Apply LoRA to attention layers
    lora_dropout=0.1,   # Dropout rate
    bias="none",
    task_type="CAUSAL_LM"  # Language modeling task
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Train the model

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./finetuned_llm_lora",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_dir="./logs",
    fp16=True,
    save_steps=500,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

trainer.train()

Implement a Custom Tokenizer

What is a tokenizer?

A tokenizer is an important component of any NLP model. It is responsible for converting text into tokens (e.g., words, characters, or subwords) that LLMs can work with instead of raw text. The choice of tokenizer significantly impacts the performance of the model and the ability to generalize across different tasks.

Shorter token sequences = faster inference and training. A good tokenizer reduces unncessary tokens and reduces the vocabulary size.
Better tokenization improves generalization. A good tokenizer will be able to handle out-of-vocabulary words (OOV) better. Words "play", "playing", "plays" should be tokenized like "play", "play" + "ing", "play" + "s" rather than three different tokens so that the model can understand the relationship between the words.
Smaller vocabularies can work for low-resource languages.
Custom tokenizers may improve domain-specific tasks, e.g., legal, medical, coding, etc, where there are specific tokens that are not found in the general tokenizer.

How to implement a custom tokenizer?

Collect domain-specific text data for training the tokenizer.

import os

# Example: Load text files from a directory
data_path = "custom_texts/"
files = [os.path.join(data_path, f) for f in os.listdir(data_path) if f.endswith(".txt")]

# Read all files into a single text corpus
corpus = []
for file in files:
    with open(file, "r", encoding="utf-8") as f:
        corpus.append(f.read())

# Convert into a list of lines
corpus = "\n".join(corpus).split("\n")

Train the tokenizer. We use Byte Pair Encoding (BPE) for this example. The new dictionary will be saved as custom_tokenizer.json.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize a tokenizer with BPE model
tokenizer = Tokenizer(BPE())

# Define a trainer
trainer = BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])

# Use whitespace pre-tokenization
tokenizer.pre_tokenizer = Whitespace()

# Train the tokenizer on the custom dataset
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

tokenizer.train_from_iterator(corpus, trainer)

# Save the tokenizer
tokenizer.save("custom_tokenizer.json")

Load the custom tokenizer. Note that a tokenization includes encoding (tokenize - convert text to tokens) and decoding (detokenize - convert tokens back to text).

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_file="custom_tokenizer.json", unk_token="[UNK]", pad_token="[PAD]")

# Test encoding
text = "Hello, how are you?"
tokens = hf_tokenizer.encode(text)
print("Tokens:", tokens)

# Decode back
decoded_text = hf_tokenizer.decode(tokens)
print("Decoded:", decoded_text)

Build a Chatbot with Ollama

Ollama

References:

What is Ollama?

Ollama is an application designed to make running LLMs locally easy. It is a lightweight and fast alternative to large cloud-based LLMs.

Advantages of Ollama:

Cross-platform: Ollama is available on Windows, Linux, and macOS.
Multiple LLMs: Ollama supports a wide range of LLMs, including Llama, GPT, and DeepSeek.

What can Ollama do?

Run LLMs Locally
- Supports various open-source LLMs (e.g., LLaMA, Mistral, Gemma, Phi-2).
- No need for an internet connection once the model is downloaded.
- Efficient memory management for running LLMs on laptops and desktops.
Easy Model Management
- Install models with simple commands (ollama pull <model-name>)
- Supports custom model creation with fine-tuned weights and configurations
Flexible API for Developers
- Provides a CLI (Command Line Interface) and a Python API.
- Can be integrated into applications for chatbots, text generation, and NLP tasks.
Prompt Engineering & Fine-Tuning
- Allows users to customize system prompts for better responses.
- Supports parameter tuning to control model behavior.

Use Cases:

Chatbots & Assistants – Build local AI-powered assistants.
Text Generation – Summarization, paraphrasing, creative writing.
Code Generation – AI-assisted coding with models like CodeLLaMA.
Privacy-Sensitive Applications – Run LLMs without sending data to the cloud.

Create a Python Chatbot with Ollama

import requests
from requests.exceptions import ConnectionError
import json

def chat_with_ollama(prompt, model="mistral"):
    url = "http://localhost:11434/api/generate"  # Changed back to default Ollama port
    data = {
        "model": model,
        "prompt": prompt
    }
    try:
        response = requests.post(url, json=data, stream=True)
        if not response.ok:
            return f"Error: API returned status code {response.status_code}"

        # Handle streaming response
        full_response = ""
        for line in response.iter_lines():
            if line:
                # Decode the line and parse it as JSON
                json_response = json.loads(line)
                if "response" in json_response:
                    full_response += json_response["response"]
                
        return full_response
    except ConnectionError:
        return "Error: Cannot connect to Ollama server. Please ensure it's running on port 11434."
    except json.JSONDecodeError as e:
        return f"Error: Invalid JSON response from server. Details: {str(e)}"
    except Exception as e:
        return f"Error: {str(e)}"

# Chat loop
print("Chatbot (type 'exit' to quit):")
while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break
    response = chat_with_ollama(user_input)
    print(f"Bot: {response}")

How it works:

The script sends user input to Ollama’s local API.
The model generates a response and returns it.
The chatbot runs in a loop until the user types “exit”.

Running the Chatbot:

ollama pull mistral
ollama serve
python chatbot.py

Useful Commands:

ollama pull mistral - Pull the mistral model
ollama serve - Start the Ollama server
ollama ps - see what models are currently loaded into memory.
ollama stop <container_id> - Stop a container
ollama rm <container_id> - Remove a container
ollama list - List all models. This is useful because when you pull a model, e.g., ollama pull mistral, it eventually shows up as mistral:latest.

Issues:

Error: listen tcp 127.0.0.1:11434: bind: address already in use
- This means that the server is already running on the port.
- To fix this, you can either stop the server or run it on a different port.
- ollama stop <container_id> - Stop a container
- OLLAMA_HOST=127.0.0.1:11500 ollama serve - Run the server on a different port
- Setup environment variables on Linux: https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux

Transformer in Pytorch

Attention Mechanism and Position Embedding

How to implement a Transformer model in Pytorch?

Fine-tune an LLM with LangChain

Langchain Agents for Chatbots

Integrating RAG with LangChain

Train an LLM with LoRA & QLoRA

Implement a Custom Tokenizer

What is a tokenizer?

How to implement a custom tokenizer?

Build a Chatbot with Ollama

Ollama

Create a Python Chatbot with Ollama

Enjoy Reading This Article?