LLM Series - Part 3 - Build a Chatbot with Ollama
Background
Ollama
References:
What is Ollama?
Ollama is an application designed to make running LLMs locally easy. It is a lightweight and fast alternative to large cloud-based LLMs.
Advantages of Ollama:
- Cross-platform: Ollama is available on Windows, Linux, and macOS.
- Multiple LLMs: Ollama supports a wide range of LLMs, including Llama, GPT, and DeepSeek.
What can Ollama do?
- Run LLMs Locally
- Supports various open-source LLMs (e.g., LLaMA, Mistral, Gemma, Phi-2).
- No need for an internet connection once the model is downloaded.
- Efficient memory management for running LLMs on laptops and desktops.
- Easy Model Management
- Install models with simple commands (
ollama pull <model-name>
) - Supports custom model creation with fine-tuned weights and configurations
- Install models with simple commands (
- Flexible API for Developers
- Provides a CLI (Command Line Interface) and a Python API.
- Can be integrated into applications for chatbots, text generation, and NLP tasks.
- Prompt Engineering & Fine-Tuning
- Allows users to customize system prompts for better responses.
- Supports parameter tuning to control model behavior.
Use Cases:
- Chatbots & Assistants – Build local AI-powered assistants.
- Text Generation – Summarization, paraphrasing, creative writing.
- Code Generation – AI-assisted coding with models like CodeLLaMA.
- Privacy-Sensitive Applications – Run LLMs without sending data to the cloud.
Create a Python Chatbot with Ollama
import requests
from requests.exceptions import ConnectionError
import json
def chat_with_ollama(prompt, model="mistral"):
url = "http://localhost:11434/api/generate" # Changed back to default Ollama port
data = {
"model": model,
"prompt": prompt
}
try:
response = requests.post(url, json=data, stream=True)
if not response.ok:
return f"Error: API returned status code {response.status_code}"
# Handle streaming response
full_response = ""
for line in response.iter_lines():
if line:
# Decode the line and parse it as JSON
json_response = json.loads(line)
if "response" in json_response:
full_response += json_response["response"]
return full_response
except ConnectionError:
return "Error: Cannot connect to Ollama server. Please ensure it's running on port 11434."
except json.JSONDecodeError as e:
return f"Error: Invalid JSON response from server. Details: {str(e)}"
except Exception as e:
return f"Error: {str(e)}"
# Chat loop
print("Chatbot (type 'exit' to quit):")
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
response = chat_with_ollama(user_input)
print(f"Bot: {response}")
How it works:
- The script sends user input to Ollama’s local API.
- The model generates a response and returns it.
- The chatbot runs in a loop until the user types “exit”.
Running the Chatbot:
ollama pull mistral
ollama serve
python chatbot.py
Useful Commands:
-
ollama pull mistral
- Pull the mistral model -
ollama serve
- Start the Ollama server -
ollama ps
- see what models are currently loaded into memory. -
ollama stop <container_id>
- Stop a container -
ollama rm <container_id>
- Remove a container -
ollama list
- List all models. This is useful because when you pull a model, e.g.,ollama pull mistral
, it eventually shows up asmistral:latest
.
Issues:
- Error: listen tcp 127.0.0.1:11434: bind: address already in use
- This means that the server is already running on the port.
- To fix this, you can either stop the server or run it on a different port.
-
ollama stop <container_id>
- Stop a container -
OLLAMA_HOST=127.0.0.1:11500 ollama serve
- Run the server on a different port - Setup environment variables on Linux: https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux
vLLM
References:
What is vLLM?
vLLM is a fast and easy-to-use library for LLM inference and serving.
What can vLLM do?
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
- Prefix caching support
- Multi-lora support
Prompt Engineering
Role Playing
LLMs can perform various roles depending on their context, training data, and prompting. The role can be specified in the system prompt. For example, Mistral provides several useful scenarios to show their prompting capabilities as in the guide: https://docs.mistral.ai/guides/prompting_capabilities/.
Example: Customer Support Classification Bot
Mistral models can easily categorize text into distinct classes. Take a customer support bot for a bank as an illustration: we can establish a series of predetermined categories within the prompt and then instruct Mistral AI models to categorize the customer’s question accordingly.
In the following example, when presented with the customer inquiry, Mistral AI models correctly categorizes it as “country support”:
import requests
from requests.exceptions import ConnectionError
import json
def chat_with_ollama(prompt, system_prompt="", model="mistral"):
url = "http://localhost:11434/api/generate"
data = {
"model": model,
"prompt": prompt,
"system": system_prompt
}
try:
response = requests.post(url, json=data, stream=True)
if not response.ok:
return f"Error: API returned status code {response.status_code}"
# Handle streaming response
full_response = ""
for line in response.iter_lines():
if line:
# Decode the line and parse it as JSON
json_response = json.loads(line)
if "response" in json_response:
full_response += json_response["response"]
return full_response
except ConnectionError:
return "Error: Cannot connect to Ollama server. Please ensure it's running on port 11434."
except json.JSONDecodeError as e:
return f"Error: Invalid JSON response from server. Details: {str(e)}"
except Exception as e:
return f"Error: {str(e)}"
# Initialize system prompt
system_prompt = """You are a bank customer service bot. Your task is to assess customer intent and categorize customer inquiry after <<<>>> into one of the following predefined categories:
card arrival
change pin
exchange rate
country support
cancel transfer
charge dispute
If the text doesn't fit into any of the above categories, classify it as:
customer service
You will only respond with the category. Do not include the word "Category". Do not provide explanations or notes."""
# Modified chat loop
print("Chatbot (type 'exit' to quit):")
print("To change system prompt, type 'change_prompt'")
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
elif user_input.lower() == "change_prompt":
system_prompt = input("Enter new system prompt: ")
print("System prompt updated!")
continue
response = chat_with_ollama(user_input, system_prompt)
print(f"Bot: {response}")
OpenAI’s API format
When sending requests to OpenAI’s API, we can specify the format of the response in the data
playload parameter like this:
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 100,
"top_p": 1,
"n": 1,
"stream": false
}
Where the parameters are:
Parameter | Type | Description |
---|---|---|
“model” | string |
The model to use (“gpt-4”, “gpt-3.5-turbo”, etc.) |
“messages” | list |
List of messages forming the conversation history |
“role” | string |
Role of each message: “system”, “user”, “assistant” |
“content” | string |
The actual text content of the message |
“temperature” | float |
Controls randomness (0 = deterministic, 1 = highly random) |
“max_tokens” | int |
The max number of tokens the response can have |
“top_p” | float |
Probability mass for nucleus sampling (alternative to temperature) |
“n” | int |
Number of responses to generate |
“stream” | bool |
If true, streams back tokens as they are generated |
Multi-turn conversations to help the model understand the context of the conversation:
- The conversation history helps maintain context
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are an AI that provides programming advice."},
{"role": "user", "content": "How do I write a Python function?"},
{"role": "assistant", "content": "You can define a function using the `def` keyword."},
{"role": "user", "content": "Can you give me an example?"}
]
}
Useful strategies
- Few shot learning: Few-shot learning or in-context learning is when we give a few examples in the prompts, and the LLM can generate corresponding output based on the example demonstrations.
- Step-by-step instructions: This strategy is inspired by the chain-of-thought prompting that enables LLMs to use a series of intermediate reasoning steps to tackle complex tasks. It’s often easier to solve complex problems when we decompose them into simpler and small steps and it’s easier for us to debug and inspect the model behavior.
- Output formatting: We can ask LLMs to output in a certain format by directly asking “write a report in the Markdown format”.
- Example generation: We can ask LLMs to automatically guide the reasoning and understanding process by generating examples with the explanations and steps.
Some real-world prompting examples
- Codechain by Salesforce at https://github.com/SalesforceAIResearch/CodeChain/blob/main/prompts/codechain_gen.txt
And the above prompt file is used in this file https://github.com/SalesforceAIResearch/CodeChain/blob/main/src/generate.py.
The flow of the code as follows:
# Load the prompt file
with open(args.prompt_file, 'r', encoding='utf-8') as infile:
prompt = infile.read()
# replace the placeholders in the prompt with the actual values
curr_prompt = prompt.replace("<<problem>>", question)
if '<<starter_code>>' in prompt:
starter_code = problem['starter_code']
curr_prompt = curr_prompt.replace("<<starter_code>>", starter_code)
if '<<starter_code_task>>' in prompt:
starter_code = problem['starter_code']
if len(starter_code)>0:
starter_code_prompt = f"Notes:\nThe final python function should begin with: \n```python\n{starter_code}\n```"
else:
starter_code_prompt = ''
curr_prompt = curr_prompt.replace("<<starter_code_task>>", starter_code_prompt)
if '<<question_guide>>' in prompt:
starter_code = problem['starter_code']
if len(starter_code)>0:
question_guide = 'use the provided function signature'
else:
question_guide = 'read from and write to standard IO'
curr_prompt = curr_prompt.replace("<<question_guide>>", question_guide)
if '<<modules>>' in curr_prompt:
if problem_id not in modules: continue
curr_modules = list(modules[problem_id])
module_seq = ''
for module in curr_modules:
module_seq += "\n```module\n" + module.strip() + "\n```\n"
curr_prompt = curr_prompt.replace('<<modules>>', module_seq)
# Call the API
response = openai.ChatCompletion.create(
model=model_mapping[args.model],
messages=[
{"role": "system",
"content": "You are a helpful AI assistant to help developers to solve challenging coding problems."},
{"role": "user",
"content": curr_prompt}
],
n=5 if args.num_gen_samples > 1 else 1,
)
Enjoy Reading This Article?
Here are some more articles you might like to read next: