Ollama - Offline AI Models

Overview

Ollama is an open-source framework that enables users to run large language models locally on their machines. It provides a simple command-line interface and API for interacting with various AI models without requiring internet connectivity, ensuring complete data privacy and control.

Complete Offline Operation

Run AI models entirely on your local hardware without internet dependency

Easy Model Management

Simple commands to pull, run, and manage different AI models

Privacy Focused

Your data never leaves your machine, ensuring complete privacy

Getting Started

Installation

Download Ollama from ollama.com
Install for your operating system (Windows, macOS, Linux)
Verify installation by running ollama --version
Pull your first model using ollama pull model-name

System Requirements

Minimum: 8GB RAM, 10GB storage
Recommended: 16GB+ RAM, 20GB+ storage, GPU support
Optimal: 32GB+ RAM, dedicated GPU with 8GB+ VRAM
Supported GPUs: NVIDIA, AMD, Apple Silicon

Available Models

Llama 2

Variants: 7B, 13B, 70B parameters
Use Cases: General purpose, chat
RAM Required: 4GB - 40GB
License: Commercial-friendly

Mistral

Variants: 7B, 8x7B (Mixtral)
Use Cases: Efficient reasoning
RAM Required: 4GB - 30GB
License: Apache 2.0

Code Llama

Variants: 7B, 13B, 34B
Use Cases: Code generation
RAM Required: 4GB - 40GB
License: Custom commercial

Basic Usage

Command Line Interface

# Pull a model (downloads to local storage)
ollama pull llama2
ollama pull mistral
ollama pull codellama

# Run a model interactively
ollama run llama2
# This starts an interactive chat session

# Run with specific prompt
ollama run llama2 "Explain quantum computing"

# List downloaded models
ollama list

# Remove a model
ollama rm llama2

# Show model information
ollama show llama2

# Create custom model from Modelfile
ollama create my-model -f ./Modelfile

# Run with GPU acceleration (if available)
OLLAMA_GPU=1 ollama run llama2

API Usage

# Ollama provides a REST API on port 11434

# Generate completion via API
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat completion
curl -X POST http://localhost:11434/api/chat -d '{
  "model": "llama2",
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ],
  "stream": false
}'

# List models via API
curl http://localhost:11434/api/tags

# Show model info
curl http://localhost:11434/api/show -d '{
  "name": "llama2"
}'

Python Integration

# Using Ollama with Python
import requests
import json

class OllamaClient:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url
        
    def generate(self, prompt, model="llama2", stream=False):
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": stream
        }
        
        response = requests.post(
            f"{self.base_url}/api/generate",
            json=payload,
            stream=stream
        )
        
        if stream:
            return self._handle_stream(response)
        else:
            return response.json()["response"]
    
    def chat(self, messages, model="llama2", stream=False):
        payload = {
            "model": model,
            "messages": messages,
            "stream": stream
        }
        
        response = requests.post(
            f"{self.base_url}/api/chat",
            json=payload,
            stream=stream
        )
        
        if stream:
            return self._handle_chat_stream(response)
        else:
            return response.json()["message"]["content"]
    
    def _handle_stream(self, response):
        for line in response.iter_lines():
            if line:
                data = json.loads(line)
                if "response" in data:
                    yield data["response"]
                if data.get("done", False):
                    break
    
    def _handle_chat_stream(self, response):
        for line in response.iter_lines():
            if line:
                data = json.loads(line)
                if "message" in data:
                    yield data["message"]["content"]
                if data.get("done", False):
                    break

# Example usage
client = OllamaClient()

# Simple generation
response = client.generate("Explain machine learning")
print(response)

# Chat conversation
messages = [
    {"role": "user", "content": "Hello, introduce yourself"}
]
response = client.chat(messages)
print(response)

# Streaming response
print("Streaming response:")
for chunk in client.generate("Tell me a story", stream=True):
    print(chunk, end="", flush=True)

Custom Model Configuration

Creating Modelfiles

# Modelfile for custom model configuration

# Base model to build upon
FROM llama2

# System prompt template
{% raw %}
TEMPLATE """[INST] <>
You are a helpful, respectful, and honest assistant. 
Always answer as helpfully as possible.

Current conversation:
{{- if .System }}
{{ .System }}
{{- end }}

{{- if .Prompt }}
Human: {{ .Prompt }}
{{- end }}
<>
{{ .Prompt }} [/INST]
"""
{% endraw %}

# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096

# System message
SYSTEM """You are a technical expert specializing in software development 
and computer science. Provide detailed, accurate explanations and 
always include code examples when relevant."""

# Custom stop tokens
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"

# Example: Create and use custom model
# ollama create my-llama -f ./Modelfile
# ollama run my-llama

Advanced Modelfile Examples

# Modelfile for code assistant
FROM codellama:13b

SYSTEM """You are an expert programming assistant. Follow these rules:
1. Always provide working code examples
2. Include comments explaining the code
3. Suggest best practices and alternatives
4. Point out potential issues and edge cases
5. Format code properly with syntax highlighting when possible
"""

PARAMETER temperature 0.3
PARAMETER top_p 0.95
PARAMETER num_ctx 8192

# Modelfile for creative writing
FROM mistral:7b

SYSTEM """You are a creative writing assistant. Your style should be:
- Engaging and descriptive
- Emotionally resonant
- Varied in sentence structure
- Rich in sensory details
- Appropriate for the requested genre
"""

PARAMETER temperature 0.8
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

# Modelfile for technical documentation
FROM llama2:13b

SYSTEM """You are a technical documentation specialist. Your responses should be:
- Clear and concise
- Well-structured with headings
- Include practical examples
- Use appropriate technical terminology
- Follow documentation best practices
"""

PARAMETER temperature 0.2
PARAMETER top_p 0.85

Performance Optimization

GPU Acceleration

# Enable GPU acceleration on different platforms

# Linux with NVIDIA GPU
export OLLAMA_GPU=1
ollama serve

# macOS with Apple Silicon
# GPU acceleration is automatic on Apple Silicon

# Windows with NVIDIA GPU
set OLLAMA_GPU=1
ollama serve

# Check if GPU is being used
ollama ps

# Monitor GPU usage
nvidia-smi  # For NVIDIA GPUs

# Force CPU-only mode
export OLLAMA_GPU=0
ollama serve

# Memory optimization for larger models
# Use quantization to reduce memory usage
ollama pull llama2:13b-q4_0  # 4-bit quantization
ollama pull codellama:34b-q2_K  # 2-bit quantization

# Available quantization levels:
# q2_K, q3_K, q4_0, q4_1, q5_0, q5_1, q6_K, q8_0

System Configuration

# Optimize system settings for better performance

# Increase system limits (Linux/macOS)
sudo sysctl -w vm.max_map_count=262144

# Set environment variables for optimization
export OLLAMA_NUM_PARALLEL=4  # Parallel processing
export OLLAMA_MAX_LOADED_MODELS=2  # Limit loaded models

# Docker configuration for Ollama
docker run -d \
  --name ollama \
  -p 11434:11434 \
  --gpus=all \
  -v ollama:/root/.ollama \
  ollama/ollama

# Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama
        ports:
        - containerPort: 11434
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

Integration Examples

Web Application

# Flask web app with Ollama integration
from flask import Flask, request, jsonify, render_template_string
import requests

app = Flask(__name__)

OLLAMA_URL = "http://localhost:11434"

HTML_TEMPLATE = '''
<!DOCTYPE html>
<html>
<head>
    <title>Ollama Chat</title>
    <style>
        .chat-container { max-width: 800px; margin: 0 auto; padding: 20px; }
        .message { margin: 10px 0; padding: 10px; border-radius: 5px; }
        .user { background: #e3f2fd; text-align: right; }
        .assistant { background: #f5f5f5; }
        #response { white-space: pre-wrap; }
    </style>
</head>
<body>
    <div class="chat-container">
        <h1>Chat with Local AI</h1>
        <form id="chatForm">
            <textarea name="message" rows="4" style="width: 100%" placeholder="Enter your message..."></textarea>
            <br>
            <select name="model">
                <option value="llama2">Llama 2</option>
                <option value="mistral">Mistral</option>
                <option value="codellama">Code Llama</option>
            </select>
            <button type="submit">Send</button>
        </form>
        <div id="response"></div>
    </div>
    <script>
        document.getElementById('chatForm').addEventListener('submit', async (e) => {
            e.preventDefault();
            const formData = new FormData(e.target);
            const response = await fetch('/chat', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({
                    message: formData.get('message'),
                    model: formData.get('model')
                })
            });
            const data = await response.json();
            document.getElementById('response').innerHTML = 
                '<div class="message user">' + formData.get('message') + '</div>' +
                '<div class="message assistant">' + data.response + '</div>';
        });
    </script>
</body>
</html>
'''

@app.route('/')
def index():
    return render_template_string(HTML_TEMPLATE)

@app.route('/chat', methods=['POST'])
def chat():
    data = request.json
    message = data.get('message')
    model = data.get('model', 'llama2')
    
    try:
        response = requests.post(
            f"{OLLAMA_URL}/api/generate",
            json={
                "model": model,
                "prompt": message,
                "stream": False
            }
        )
        return jsonify({"response": response.json()["response"]})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(debug=True, port=5000)

Command Line Tool

#!/usr/bin/env python3
# Advanced CLI tool for Ollama

import argparse
import sys
from OllamaClient import OllamaClient

def main():
    parser = argparse.ArgumentParser(description='Ollama CLI Tool')
    parser.add_argument('--model', default='llama2', help='Model to use')
    parser.add_argument('--temperature', type=float, default=0.7, help='Temperature')
    parser.add_argument('--max-tokens', type=int, default=1000, help='Max tokens')
    parser.add_argument('--stream', action='store_true', help='Stream response')
    parser.add_argument('prompt', nargs='?', help='Prompt to send')
    
    args = parser.parse_args()
    client = OllamaClient()
    
    if args.prompt:
        # Single prompt mode
        if args.stream:
            for chunk in client.generate(
                args.prompt, 
                model=args.model, 
                stream=True
            ):
                print(chunk, end='', flush=True)
            print()
        else:
            response = client.generate(args.prompt, model=args.model)
            print(response)
    else:
        # Interactive mode
        print(f"Starting chat with {args.model}. Type 'quit' to exit.")
        while True:
            try:
                user_input = input("You: ")
                if user_input.lower() in ['quit', 'exit', 'q']:
                    break
                    
                if args.stream:
                    print("AI: ", end='', flush=True)
                    for chunk in client.generate(user_input, model=args.model, stream=True):
                        print(chunk, end='', flush=True)
                    print()
                else:
                    response = client.generate(user_input, model=args.model)
                    print(f"AI: {response}")
            except KeyboardInterrupt:
                print("\nGoodbye!")
                break

if __name__ == '__main__':
    main()

Best Practices

Model Management

Choose Right Size: Select model size based on your hardware capabilities
Use Quantization: Employ quantized models to reduce memory usage
Monitor Resources: Keep track of RAM and GPU usage
Regular Updates: Update Ollama and models regularly
Backup Custom Models: Export and backup your custom model configurations

Performance Tips

# Performance optimization script
import psutil
import GPUtil

def check_system_resources():
    # Check RAM usage
    ram = psutil.virtual_memory()
    print(f"RAM Usage: {ram.percent}% ({ram.used//1024**3}GB/{ram.total//1024**3}GB)")
    
    # Check GPU usage if available
    try:
        gpus = GPUtil.getGPUs()
        for gpu in gpus:
            print(f"GPU {gpu.id}: {gpu.load*100}% load, {gpu.memoryUtil*100}% memory")
    except:
        print("No GPU detected or GPU monitoring not available")
    
    # Check disk space for models
    disk = psutil.disk_usage('/')
    print(f"Disk Space: {disk.percent}% used")

def optimize_ollama_settings():
    settings = {
        "OLLAMA_NUM_PARALLEL": min(4, psutil.cpu_count()),
        "OLLAMA_MAX_LOADED_MODELS": 2,
    }
    
    if psutil.virtual_memory().total < 16 * 1024**3:  # Less than 16GB RAM
        settings["OLLAMA_NUM_PARALLEL"] = 2
        settings["OLLAMA_MAX_LOADED_MODELS"] = 1
    
    return settings

# Usage
if __name__ == "__main__":
    check_system_resources()
    optimal_settings = optimize_ollama_settings()
    print("Recommended settings:", optimal_settings)