LLM Series - Part 2 - Common Implementations in LLMs
Transformer in Pytorch
Attention Mechanism and Position Embedding
Cross-Attention and Self-Attention
Aspect | Self-Attention | Cross-Attention |
---|---|---|
Definition | Focuses on relationships between elements of the same sequence (e.g., within a sentence). | Focuses on relationships between elements of one sequence and another sequence (e.g., query and context). |
Inputs | Single sequence (e.g., the same sequence is used for queries, keys, and values). | Two sequences: one provides queries, and the other provides keys and values. |
Purpose | Captures intra-sequence dependencies, helping the model understand context within the same sequence. | Captures inter-sequence dependencies, aligning information between different sequences. |
Key Benefit | Helps the model understand contextual relationships within a sequence. | Enables the model to incorporate external information from another sequence. Very important for multi-modal tasks. |
Note that in the encoder, we only use self-attention. In the decoder, we use cross-attention to attend to the encoder’s output.
Details of Attention Mechanism
Below is the table of the operations and dimensions of the attention mechanisms including the original and two additional operations proposed in our paper (KPOP).
Operation | Original Operation | Original Dim | Concatenative Operation | Concatenative Dim | Additive Operation | Additive Dim |
---|---|---|---|---|---|---|
Q | WqZ | b×mz×d | WqZ | b×mz×d | WqZ | b×mz×d |
K | WkC | b×mc×d | Wkcat(C,repeat(p,b)) | b×(mc+mp)×d | Wk(C+repeat(p,b)) | b×mc×d |
V | WvC | b×mc×d | Wvcat(C,repeat(p,b)) | b×(mc+mp)×d | Wv(C+repeat(p,b)) | b×mc×d |
A | σ(QKT/√d) | b×mz×mc | σ(QKT/√d) | b×mz×(mc+mp) | σ(QKT/√d) | b×mz×mc |
O | AV | b×mz×d | AV | b×mz×d | AV | b×mz×d |
Note: cat(), repeat(·,b), σ() represent the concatenate (at dim=1), repeat an input b times (at dim=0) and the softmax operations (at dim=2), respectively.
In the table, Z is the input sequence (i.e., the query), C is the context sequence that should be attended to. In the self-attention, Z and C are the same.
Why Multi-Head Attention?
Multi-head attention is a key component in Transformer models and is used to enhance the model’s ability to capture different types of relationships and patterns in the input data.
- Learning Different Representations: Each “head” in multi-head attention operates independently, using its own set of learned weight matrices. This allows each head to focus on different parts of the input sequence or on different types of relationships within the sequence.
- Dimensionality Flexibility: Each attention head operates on a reduced-dimensional subspace of the input embeddings, as the total embedding dimension is split across all heads. This division reduces the computational cost of individual attention heads, while the aggregation of all heads retains the full expressiveness of the original dimensionality.
Position Embedding
For position \(pos\) and dimension \(i\) in the embedding:
\[PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\] \[PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]Where:
- \(pos\) is the position in the sequence (0 to max_len-1)
- \(i\) is the dimension index (0 to d_model/2)
- \(d_{model}\) is the embedding dimension
This creates a unique position encoding for each position in the sequence using alternating sine and cosine functions at different frequencies.
How to implement a Transformer model in Pytorch?
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding, self).__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.pe = pe.unsqueeze(0)
def forward(self, x):
return x + self.pe[:, :x.size(1), :].to(x.device)
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
assert d_model % num_heads == 0
self.d_k = d_model // num_heads
self.num_heads = num_heads
self.qkv_proj = nn.Linear(d_model, d_model * 3)
self.fc = nn.Linear(d_model, d_model)
def forward(self, x, mask=None, encoder_output=None):
batch_size, seq_length, d_model = x.size()
qkv = self.qkv_proj(x).reshape(batch_size, seq_length, 3, self.num_heads, self.d_k)
q, k, v = qkv.permute(2, 0, 3, 1, 4) # (batch, num_heads, seq_length, d_k)
if encoder_output is not None:
k = encoder_output
v = encoder_output
attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
attn_weights = F.softmax(attn_scores, dim=-1)
attn_output = torch.matmul(attn_weights, v)
attn_output = attn_output.transpose(1, 2).reshape(batch_size, seq_length, d_model)
return self.fc(attn_output)
class FeedForward(nn.Module):
def __init__(self, d_model, hidden_dim):
super(FeedForward, self).__init__()
self.fc1 = nn.Linear(d_model, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, d_model)
def forward(self, x):
return self.fc2(F.relu(self.fc1(x)))
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, hidden_dim, is_decoder=False):
super(TransformerBlock, self).__init__()
self.attn = MultiHeadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.ff = FeedForward(d_model, hidden_dim)
self.norm2 = nn.LayerNorm(d_model)
self.is_decoder = is_decoder
if is_decoder:
self.cross_attn = MultiHeadAttention(d_model, num_heads)
self.norm3 = nn.LayerNorm(d_model)
def forward(self, x, mask=None, encoder_output=None):
x = self.norm1(x + self.attn(x, mask))
if self.is_decoder and encoder_output is not None:
x = self.norm3(x + self.cross_attn(x, encoder_output=encoder_output))
x = self.norm2(x + self.ff(x))
return x
class Transformer(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, num_layers, hidden_dim, max_len=5000):
super(Transformer, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model, max_len)
self.encoder_layers = nn.ModuleList([TransformerBlock(d_model, num_heads, hidden_dim) for _ in range(num_layers)])
self.decoder_layers = nn.ModuleList([TransformerBlock(d_model, num_heads, hidden_dim, is_decoder=True) for _ in range(num_layers)])
self.fc_out = nn.Linear(d_model, vocab_size)
def forward(self, src, tgt, src_mask=None, tgt_mask=None):
src = self.embedding(src)
src = self.pos_encoding(src)
for layer in self.encoder_layers:
src = layer(src, src_mask)
tgt = self.embedding(tgt)
tgt = self.pos_encoding(tgt)
for layer in self.decoder_layers:
tgt = layer(tgt, tgt_mask, encoder_output=src)
return self.fc_out(tgt)
Fine-tune an LLM with LangChain
Workflow:
- Load the model and tokenizer
- Load the dataset
- Create a pipeline
- Define the training arguments
- Train the model
# Load the model and tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf" # Example model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Load the dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files="data.json")
# Create a pipeline
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe)
# Define the training arguments
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./finetuned_model",
per_device_train_batch_size=4,
num_train_epochs=3,
logging_dir="./logs"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
)
trainer.train()
# Save the model
model.save_pretrained("./finetuned_llm")
tokenizer.save_pretrained("./finetuned_llm")
# Load the model
from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline.from_pretrained("./finetuned_llm")
response = llm("Explain transformers in NLP")
print(response)
Langchain Agents for Chatbots
# load the model
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
# Load fine-tuned model and tokenizer
model_path = "./finetuned_llm"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
# Create a HuggingFace pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe)
# create a langchain agent
from langchain.agents import initialize_agent
from langchain.tools import Tool
from langchain.memory import ConversationBufferMemory
# Define a simple tool (e.g., search or custom function)
def custom_tool(input_text):
return f"Processing input: {input_text}"
tools = [Tool(name="CustomTool", func=custom_tool, description="A simple tool for processing text")]
# Initialize memory for conversation history
memory = ConversationBufferMemory(memory_key="chat_history")
# Create an agent using the fine-tuned model
agent = initialize_agent(
tools=tools,
llm=llm,
agent="zero-shot-react-description",
verbose=True,
memory=memory,
)
# Test the agent
response = agent.run("What is the difference between GPT and BERT?")
print(response)
Integrating RAG with LangChain
Workflow:
- Load and split documents and store embeddings in FAISS
- Create a Retrieval-Augmented Chain
- (Optional) Deploy as a Chatbot using FastAPI
from langchain.document_loaders import TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load and split documents
loader = TextLoader("knowledge.txt") # Your text data
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)
# Use HuggingFace embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# Store embeddings in FAISS
vectorstore = FAISS.from_documents(docs, embedding_model)
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores import FAISS
# Load your fine-tuned model
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_path = "./finetuned_llm"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(pipeline=pipe)
# Set up the retriever
retriever = vectorstore.as_retriever()
# Create a retrieval-augmented Q&A pipeline
qa_chain = RetrievalQA(llm=llm, retriever=retriever)
# Ask a question
query = "What is LangChain?"
response = qa_chain.run(query)
print(response)
Train an LLM with LoRA & QLoRA
Workflow:
- Load the model (with 4-bit quantization if QLoRA is used)
- Apply LoRA or QLoRA with the
peft
library - Train the model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "meta-llama/Llama-2-7b-hf" # Example: LLaMA-2 7B
# Enable 4-bit quantization (for QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16"
)
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")
Apply LoRA or QLoRA with the peft
library
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Rank of LoRA matrices
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Apply LoRA to attention layers
lora_dropout=0.1, # Dropout rate
bias="none",
task_type="CAUSAL_LM" # Language modeling task
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Train the model
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./finetuned_llm_lora",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
logging_dir="./logs",
fp16=True,
save_steps=500,
save_total_limit=2
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset
)
trainer.train()
Implement a Custom Tokenizer
What is a tokenizer?
A tokenizer is an important component of any NLP model. It is responsible for converting text into tokens (e.g., words, characters, or subwords) that LLMs can work with instead of raw text. The choice of tokenizer significantly impacts the performance of the model and the ability to generalize across different tasks.
- Shorter token sequences = faster inference and training. A good tokenizer reduces unncessary tokens and reduces the vocabulary size.
-
Better tokenization improves generalization. A good tokenizer will be able to handle out-of-vocabulary words (OOV) better. Words
"play"
,"playing"
,"plays"
should be tokenized like"play"
,"play" + "ing"
,"play" + "s"
rather than three different tokens so that the model can understand the relationship between the words. - Smaller vocabularies can work for low-resource languages.
- Custom tokenizers may improve domain-specific tasks, e.g., legal, medical, coding, etc, where there are specific tokens that are not found in the general tokenizer.
How to implement a custom tokenizer?
Collect domain-specific text data for training the tokenizer.
import os
# Example: Load text files from a directory
data_path = "custom_texts/"
files = [os.path.join(data_path, f) for f in os.listdir(data_path) if f.endswith(".txt")]
# Read all files into a single text corpus
corpus = []
for file in files:
with open(file, "r", encoding="utf-8") as f:
corpus.append(f.read())
# Convert into a list of lines
corpus = "\n".join(corpus).split("\n")
Train the tokenizer. We use Byte Pair Encoding (BPE) for this example. The new dictionary will be saved as custom_tokenizer.json
.
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# Initialize a tokenizer with BPE model
tokenizer = Tokenizer(BPE())
# Define a trainer
trainer = BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
# Use whitespace pre-tokenization
tokenizer.pre_tokenizer = Whitespace()
# Train the tokenizer on the custom dataset
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
tokenizer.train_from_iterator(corpus, trainer)
# Save the tokenizer
tokenizer.save("custom_tokenizer.json")
Load the custom tokenizer. Note that a tokenization includes encoding (tokenize - convert text to tokens) and decoding (detokenize - convert tokens back to text).
from transformers import PreTrainedTokenizerFast
# Load the tokenizer
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_file="custom_tokenizer.json", unk_token="[UNK]", pad_token="[PAD]")
# Test encoding
text = "Hello, how are you?"
tokens = hf_tokenizer.encode(text)
print("Tokens:", tokens)
# Decode back
decoded_text = hf_tokenizer.decode(tokens)
print("Decoded:", decoded_text)
Build a Chatbot with Ollama
Ollama
References:
What is Ollama?
Ollama is an application designed to make running LLMs locally easy. It is a lightweight and fast alternative to large cloud-based LLMs.
Advantages of Ollama:
- Cross-platform: Ollama is available on Windows, Linux, and macOS.
- Multiple LLMs: Ollama supports a wide range of LLMs, including Llama, GPT, and DeepSeek.
What can Ollama do?
- Run LLMs Locally
- Supports various open-source LLMs (e.g., LLaMA, Mistral, Gemma, Phi-2).
- No need for an internet connection once the model is downloaded.
- Efficient memory management for running LLMs on laptops and desktops.
- Easy Model Management
- Install models with simple commands (
ollama pull <model-name>
) - Supports custom model creation with fine-tuned weights and configurations
- Install models with simple commands (
- Flexible API for Developers
- Provides a CLI (Command Line Interface) and a Python API.
- Can be integrated into applications for chatbots, text generation, and NLP tasks.
- Prompt Engineering & Fine-Tuning
- Allows users to customize system prompts for better responses.
- Supports parameter tuning to control model behavior.
Use Cases:
- Chatbots & Assistants – Build local AI-powered assistants.
- Text Generation – Summarization, paraphrasing, creative writing.
- Code Generation – AI-assisted coding with models like CodeLLaMA.
- Privacy-Sensitive Applications – Run LLMs without sending data to the cloud.
Create a Python Chatbot with Ollama
import requests
from requests.exceptions import ConnectionError
import json
def chat_with_ollama(prompt, model="mistral"):
url = "http://localhost:11434/api/generate" # Changed back to default Ollama port
data = {
"model": model,
"prompt": prompt
}
try:
response = requests.post(url, json=data, stream=True)
if not response.ok:
return f"Error: API returned status code {response.status_code}"
# Handle streaming response
full_response = ""
for line in response.iter_lines():
if line:
# Decode the line and parse it as JSON
json_response = json.loads(line)
if "response" in json_response:
full_response += json_response["response"]
return full_response
except ConnectionError:
return "Error: Cannot connect to Ollama server. Please ensure it's running on port 11434."
except json.JSONDecodeError as e:
return f"Error: Invalid JSON response from server. Details: {str(e)}"
except Exception as e:
return f"Error: {str(e)}"
# Chat loop
print("Chatbot (type 'exit' to quit):")
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
response = chat_with_ollama(user_input)
print(f"Bot: {response}")
How it works:
- The script sends user input to Ollama’s local API.
- The model generates a response and returns it.
- The chatbot runs in a loop until the user types “exit”.
Running the Chatbot:
ollama pull mistral
ollama serve
python chatbot.py
Useful Commands:
-
ollama pull mistral
- Pull the mistral model -
ollama serve
- Start the Ollama server -
ollama ps
- see what models are currently loaded into memory. -
ollama stop <container_id>
- Stop a container -
ollama rm <container_id>
- Remove a container -
ollama list
- List all models. This is useful because when you pull a model, e.g.,ollama pull mistral
, it eventually shows up asmistral:latest
.
Issues:
- Error: listen tcp 127.0.0.1:11434: bind: address already in use
- This means that the server is already running on the port.
- To fix this, you can either stop the server or run it on a different port.
-
ollama stop <container_id>
- Stop a container -
OLLAMA_HOST=127.0.0.1:11500 ollama serve
- Run the server on a different port - Setup environment variables on Linux: https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux
Enjoy Reading This Article?
Here are some more articles you might like to read next: