Background

Ollama

References:

What is Ollama?

Ollama is an application designed to make running LLMs locally easy. It is a lightweight and fast alternative to large cloud-based LLMs.

Advantages of Ollama:

  • Cross-platform: Ollama is available on Windows, Linux, and macOS.
  • Multiple LLMs: Ollama supports a wide range of LLMs, including Llama, GPT, and DeepSeek.

What can Ollama do?

  • Run LLMs Locally
    • Supports various open-source LLMs (e.g., LLaMA, Mistral, Gemma, Phi-2).
    • No need for an internet connection once the model is downloaded.
    • Efficient memory management for running LLMs on laptops and desktops.
  • Easy Model Management
    • Install models with simple commands (ollama pull <model-name>)
    • Supports custom model creation with fine-tuned weights and configurations
  • Flexible API for Developers
    • Provides a CLI (Command Line Interface) and a Python API.
    • Can be integrated into applications for chatbots, text generation, and NLP tasks.
  • Prompt Engineering & Fine-Tuning
    • Allows users to customize system prompts for better responses.
    • Supports parameter tuning to control model behavior.

Use Cases:

  • Chatbots & Assistants – Build local AI-powered assistants.
  • Text Generation – Summarization, paraphrasing, creative writing.
  • Code Generation – AI-assisted coding with models like CodeLLaMA.
  • Privacy-Sensitive Applications – Run LLMs without sending data to the cloud.

Create a Python Chatbot with Ollama

import requests
from requests.exceptions import ConnectionError
import json

def chat_with_ollama(prompt, model="mistral"):
    url = "http://localhost:11434/api/generate"  # Changed back to default Ollama port
    data = {
        "model": model,
        "prompt": prompt
    }
    try:
        response = requests.post(url, json=data, stream=True)
        if not response.ok:
            return f"Error: API returned status code {response.status_code}"

        # Handle streaming response
        full_response = ""
        for line in response.iter_lines():
            if line:
                # Decode the line and parse it as JSON
                json_response = json.loads(line)
                if "response" in json_response:
                    full_response += json_response["response"]
                
        return full_response
    except ConnectionError:
        return "Error: Cannot connect to Ollama server. Please ensure it's running on port 11434."
    except json.JSONDecodeError as e:
        return f"Error: Invalid JSON response from server. Details: {str(e)}"
    except Exception as e:
        return f"Error: {str(e)}"

# Chat loop
print("Chatbot (type 'exit' to quit):")
while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break
    response = chat_with_ollama(user_input)
    print(f"Bot: {response}")

How it works:

  • The script sends user input to Ollama’s local API.
  • The model generates a response and returns it.
  • The chatbot runs in a loop until the user types “exit”.

Running the Chatbot:

ollama pull mistral
ollama serve
python chatbot.py

Useful Commands:

  • ollama pull mistral - Pull the mistral model
  • ollama serve - Start the Ollama server
  • ollama ps - see what models are currently loaded into memory.
  • ollama stop <container_id> - Stop a container
  • ollama rm <container_id> - Remove a container
  • ollama list - List all models. This is useful because when you pull a model, e.g., ollama pull mistral, it eventually shows up as mistral:latest.

Issues:

  • Error: listen tcp 127.0.0.1:11434: bind: address already in use
    • This means that the server is already running on the port.
    • To fix this, you can either stop the server or run it on a different port.
    • ollama stop <container_id> - Stop a container
    • OLLAMA_HOST=127.0.0.1:11500 ollama serve - Run the server on a different port
    • Setup environment variables on Linux: https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux

vLLM

References:

What is vLLM?

vLLM is a fast and easy-to-use library for LLM inference and serving.

What can vLLM do?

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
  • Prefix caching support
  • Multi-lora support

Prompt Engineering

Role Playing

LLMs can perform various roles depending on their context, training data, and prompting. The role can be specified in the system prompt. For example, Mistral provides several useful scenarios to show their prompting capabilities as in the guide: https://docs.mistral.ai/guides/prompting_capabilities/.

Example: Customer Support Classification Bot

Mistral models can easily categorize text into distinct classes. Take a customer support bot for a bank as an illustration: we can establish a series of predetermined categories within the prompt and then instruct Mistral AI models to categorize the customer’s question accordingly.

In the following example, when presented with the customer inquiry, Mistral AI models correctly categorizes it as “country support”:

import requests
from requests.exceptions import ConnectionError
import json

def chat_with_ollama(prompt, system_prompt="", model="mistral"):
    url = "http://localhost:11434/api/generate"
    data = {
        "model": model,
        "prompt": prompt,
        "system": system_prompt
    }
    try:
        response = requests.post(url, json=data, stream=True)
        if not response.ok:
            return f"Error: API returned status code {response.status_code}"

        # Handle streaming response
        full_response = ""
        for line in response.iter_lines():
            if line:
                # Decode the line and parse it as JSON
                json_response = json.loads(line)
                if "response" in json_response:
                    full_response += json_response["response"]
                
        return full_response
    except ConnectionError:
        return "Error: Cannot connect to Ollama server. Please ensure it's running on port 11434."
    except json.JSONDecodeError as e:
        return f"Error: Invalid JSON response from server. Details: {str(e)}"
    except Exception as e:
        return f"Error: {str(e)}"

# Initialize system prompt
system_prompt = """You are a bank customer service bot. Your task is to assess customer intent and categorize customer inquiry after <<<>>> into one of the following predefined categories:

card arrival
change pin
exchange rate
country support
cancel transfer
charge dispute

If the text doesn't fit into any of the above categories, classify it as:
customer service

You will only respond with the category. Do not include the word "Category". Do not provide explanations or notes."""

# Modified chat loop
print("Chatbot (type 'exit' to quit):")
print("To change system prompt, type 'change_prompt'")
while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break
    elif user_input.lower() == "change_prompt":
        system_prompt = input("Enter new system prompt: ")
        print("System prompt updated!")
        continue
    
    response = chat_with_ollama(user_input, system_prompt)
    print(f"Bot: {response}")

OpenAI’s API format

When sending requests to OpenAI’s API, we can specify the format of the response in the data playload parameter like this:

{
  "model": "gpt-4",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "temperature": 0.7,
  "max_tokens": 100,
  "top_p": 1,
  "n": 1,
  "stream": false
}

Where the parameters are:

Parameter Type Description
“model” string The model to use (“gpt-4”, “gpt-3.5-turbo”, etc.)
“messages” list List of messages forming the conversation history
“role” string Role of each message: “system”, “user”, “assistant”
“content” string The actual text content of the message
“temperature” float Controls randomness (0 = deterministic, 1 = highly random)
“max_tokens” int The max number of tokens the response can have
“top_p” float Probability mass for nucleus sampling (alternative to temperature)
“n” int Number of responses to generate
“stream” bool If true, streams back tokens as they are generated

Multi-turn conversations to help the model understand the context of the conversation:

  • The conversation history helps maintain context
{
  "model": "gpt-4",
  "messages": [
    {"role": "system", "content": "You are an AI that provides programming advice."},
    {"role": "user", "content": "How do I write a Python function?"},
    {"role": "assistant", "content": "You can define a function using the `def` keyword."},
    {"role": "user", "content": "Can you give me an example?"}
  ]
}

Useful strategies

  • Few shot learning: Few-shot learning or in-context learning is when we give a few examples in the prompts, and the LLM can generate corresponding output based on the example demonstrations.
  • Step-by-step instructions: This strategy is inspired by the chain-of-thought prompting that enables LLMs to use a series of intermediate reasoning steps to tackle complex tasks. It’s often easier to solve complex problems when we decompose them into simpler and small steps and it’s easier for us to debug and inspect the model behavior.
  • Output formatting: We can ask LLMs to output in a certain format by directly asking “write a report in the Markdown format”.
  • Example generation: We can ask LLMs to automatically guide the reasoning and understanding process by generating examples with the explanations and steps.

Some real-world prompting examples

And the above prompt file is used in this file https://github.com/SalesforceAIResearch/CodeChain/blob/main/src/generate.py.

The flow of the code as follows:


# Load the prompt file
with open(args.prompt_file, 'r', encoding='utf-8') as infile:
    prompt = infile.read()

# replace the placeholders in the prompt with the actual values

curr_prompt = prompt.replace("<<problem>>", question)  

if '<<starter_code>>' in prompt:
    starter_code = problem['starter_code'] 
    curr_prompt = curr_prompt.replace("<<starter_code>>", starter_code)
    
if '<<starter_code_task>>' in prompt:
    starter_code = problem['starter_code'] 
    if len(starter_code)>0:
        starter_code_prompt = f"Notes:\nThe final python function should begin with: \n```python\n{starter_code}\n```"
    else:
        starter_code_prompt = ''
    curr_prompt = curr_prompt.replace("<<starter_code_task>>", starter_code_prompt)

if '<<question_guide>>' in prompt:
    starter_code = problem['starter_code'] 
    if len(starter_code)>0:
        question_guide = 'use the provided function signature'
    else:
        question_guide = 'read from and write to standard IO'
    curr_prompt = curr_prompt.replace("<<question_guide>>", question_guide)    

if '<<modules>>' in curr_prompt: 
    if problem_id not in modules: continue 
    curr_modules = list(modules[problem_id])
    module_seq = ''
    for module in curr_modules: 
        module_seq += "\n```module\n" + module.strip() + "\n```\n"
    curr_prompt = curr_prompt.replace('<<modules>>', module_seq)

# Call the API

response = openai.ChatCompletion.create(
                  model=model_mapping[args.model], 
                  messages=[
                        {"role": "system", 
                         "content": "You are a helpful AI assistant to help developers to solve challenging coding problems."},
                        {"role": "user", 
                         "content": curr_prompt}
                    ],
                  n=5 if args.num_gen_samples > 1 else 1,
)