Local AI Tools
Ollama - Offline AI Models
Run large language models locally on your machine with full offline capabilities and privacy protection
Overview
Ollama is an open-source framework that enables users to run large language models locally on their machines. It provides a simple command-line interface and API for interacting with various AI models without requiring internet connectivity, ensuring complete data privacy and control.
Complete Offline Operation
Run AI models entirely on your local hardware without internet dependency
Easy Model Management
Simple commands to pull, run, and manage different AI models
Privacy Focused
Your data never leaves your machine, ensuring complete privacy
Getting Started
Installation
- Download Ollama from ollama.com
- Install for your operating system (Windows, macOS, Linux)
- Verify installation by running
ollama --version - Pull your first model using
ollama pull model-name
System Requirements
- Minimum: 8GB RAM, 10GB storage
- Recommended: 16GB+ RAM, 20GB+ storage, GPU support
- Optimal: 32GB+ RAM, dedicated GPU with 8GB+ VRAM
- Supported GPUs: NVIDIA, AMD, Apple Silicon
Available Models
Llama 2
- Variants: 7B, 13B, 70B parameters
- Use Cases: General purpose, chat
- RAM Required: 4GB - 40GB
- License: Commercial-friendly
Mistral
- Variants: 7B, 8x7B (Mixtral)
- Use Cases: Efficient reasoning
- RAM Required: 4GB - 30GB
- License: Apache 2.0
Code Llama
- Variants: 7B, 13B, 34B
- Use Cases: Code generation
- RAM Required: 4GB - 40GB
- License: Custom commercial
Basic Usage
Command Line Interface
# Pull a model (downloads to local storage)
ollama pull llama2
ollama pull mistral
ollama pull codellama
# Run a model interactively
ollama run llama2
# This starts an interactive chat session
# Run with specific prompt
ollama run llama2 "Explain quantum computing"
# List downloaded models
ollama list
# Remove a model
ollama rm llama2
# Show model information
ollama show llama2
# Create custom model from Modelfile
ollama create my-model -f ./Modelfile
# Run with GPU acceleration (if available)
OLLAMA_GPU=1 ollama run llama2
API Usage
# Ollama provides a REST API on port 11434
# Generate completion via API
curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Chat completion
curl -X POST http://localhost:11434/api/chat -d '{
"model": "llama2",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"stream": false
}'
# List models via API
curl http://localhost:11434/api/tags
# Show model info
curl http://localhost:11434/api/show -d '{
"name": "llama2"
}'
Python Integration
# Using Ollama with Python
import requests
import json
class OllamaClient:
def __init__(self, base_url="http://localhost:11434"):
self.base_url = base_url
def generate(self, prompt, model="llama2", stream=False):
payload = {
"model": model,
"prompt": prompt,
"stream": stream
}
response = requests.post(
f"{self.base_url}/api/generate",
json=payload,
stream=stream
)
if stream:
return self._handle_stream(response)
else:
return response.json()["response"]
def chat(self, messages, model="llama2", stream=False):
payload = {
"model": model,
"messages": messages,
"stream": stream
}
response = requests.post(
f"{self.base_url}/api/chat",
json=payload,
stream=stream
)
if stream:
return self._handle_chat_stream(response)
else:
return response.json()["message"]["content"]
def _handle_stream(self, response):
for line in response.iter_lines():
if line:
data = json.loads(line)
if "response" in data:
yield data["response"]
if data.get("done", False):
break
def _handle_chat_stream(self, response):
for line in response.iter_lines():
if line:
data = json.loads(line)
if "message" in data:
yield data["message"]["content"]
if data.get("done", False):
break
# Example usage
client = OllamaClient()
# Simple generation
response = client.generate("Explain machine learning")
print(response)
# Chat conversation
messages = [
{"role": "user", "content": "Hello, introduce yourself"}
]
response = client.chat(messages)
print(response)
# Streaming response
print("Streaming response:")
for chunk in client.generate("Tell me a story", stream=True):
print(chunk, end="", flush=True)
Custom Model Configuration
Creating Modelfiles
# Modelfile for custom model configuration
# Base model to build upon
FROM llama2
# System prompt template
{% raw %}
TEMPLATE """[INST] <>
You are a helpful, respectful, and honest assistant.
Always answer as helpfully as possible.
Current conversation:
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Prompt }}
Human: {{ .Prompt }}
{{- end }}
< >
{{ .Prompt }} [/INST]
"""
{% endraw %}
# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
# System message
SYSTEM """You are a technical expert specializing in software development
and computer science. Provide detailed, accurate explanations and
always include code examples when relevant."""
# Custom stop tokens
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
# Example: Create and use custom model
# ollama create my-llama -f ./Modelfile
# ollama run my-llama
Advanced Modelfile Examples
# Modelfile for code assistant
FROM codellama:13b
SYSTEM """You are an expert programming assistant. Follow these rules:
1. Always provide working code examples
2. Include comments explaining the code
3. Suggest best practices and alternatives
4. Point out potential issues and edge cases
5. Format code properly with syntax highlighting when possible
"""
PARAMETER temperature 0.3
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
# Modelfile for creative writing
FROM mistral:7b
SYSTEM """You are a creative writing assistant. Your style should be:
- Engaging and descriptive
- Emotionally resonant
- Varied in sentence structure
- Rich in sensory details
- Appropriate for the requested genre
"""
PARAMETER temperature 0.8
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
# Modelfile for technical documentation
FROM llama2:13b
SYSTEM """You are a technical documentation specialist. Your responses should be:
- Clear and concise
- Well-structured with headings
- Include practical examples
- Use appropriate technical terminology
- Follow documentation best practices
"""
PARAMETER temperature 0.2
PARAMETER top_p 0.85
Performance Optimization
GPU Acceleration
# Enable GPU acceleration on different platforms
# Linux with NVIDIA GPU
export OLLAMA_GPU=1
ollama serve
# macOS with Apple Silicon
# GPU acceleration is automatic on Apple Silicon
# Windows with NVIDIA GPU
set OLLAMA_GPU=1
ollama serve
# Check if GPU is being used
ollama ps
# Monitor GPU usage
nvidia-smi # For NVIDIA GPUs
# Force CPU-only mode
export OLLAMA_GPU=0
ollama serve
# Memory optimization for larger models
# Use quantization to reduce memory usage
ollama pull llama2:13b-q4_0 # 4-bit quantization
ollama pull codellama:34b-q2_K # 2-bit quantization
# Available quantization levels:
# q2_K, q3_K, q4_0, q4_1, q5_0, q5_1, q6_K, q8_0
System Configuration
# Optimize system settings for better performance
# Increase system limits (Linux/macOS)
sudo sysctl -w vm.max_map_count=262144
# Set environment variables for optimization
export OLLAMA_NUM_PARALLEL=4 # Parallel processing
export OLLAMA_MAX_LOADED_MODELS=2 # Limit loaded models
# Docker configuration for Ollama
docker run -d \
--name ollama \
-p 11434:11434 \
--gpus=all \
-v ollama:/root/.ollama \
ollama/ollama
# Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
Integration Examples
Web Application
# Flask web app with Ollama integration
from flask import Flask, request, jsonify, render_template_string
import requests
app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434"
HTML_TEMPLATE = '''
<!DOCTYPE html>
<html>
<head>
<title>Ollama Chat</title>
<style>
.chat-container { max-width: 800px; margin: 0 auto; padding: 20px; }
.message { margin: 10px 0; padding: 10px; border-radius: 5px; }
.user { background: #e3f2fd; text-align: right; }
.assistant { background: #f5f5f5; }
#response { white-space: pre-wrap; }
</style>
</head>
<body>
<div class="chat-container">
<h1>Chat with Local AI</h1>
<form id="chatForm">
<textarea name="message" rows="4" style="width: 100%" placeholder="Enter your message..."></textarea>
<br>
<select name="model">
<option value="llama2">Llama 2</option>
<option value="mistral">Mistral</option>
<option value="codellama">Code Llama</option>
</select>
<button type="submit">Send</button>
</form>
<div id="response"></div>
</div>
<script>
document.getElementById('chatForm').addEventListener('submit', async (e) => {
e.preventDefault();
const formData = new FormData(e.target);
const response = await fetch('/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
message: formData.get('message'),
model: formData.get('model')
})
});
const data = await response.json();
document.getElementById('response').innerHTML =
'<div class="message user">' + formData.get('message') + '</div>' +
'<div class="message assistant">' + data.response + '</div>';
});
</script>
</body>
</html>
'''
@app.route('/')
def index():
return render_template_string(HTML_TEMPLATE)
@app.route('/chat', methods=['POST'])
def chat():
data = request.json
message = data.get('message')
model = data.get('model', 'llama2')
try:
response = requests.post(
f"{OLLAMA_URL}/api/generate",
json={
"model": model,
"prompt": message,
"stream": False
}
)
return jsonify({"response": response.json()["response"]})
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(debug=True, port=5000)
Command Line Tool
#!/usr/bin/env python3
# Advanced CLI tool for Ollama
import argparse
import sys
from OllamaClient import OllamaClient
def main():
parser = argparse.ArgumentParser(description='Ollama CLI Tool')
parser.add_argument('--model', default='llama2', help='Model to use')
parser.add_argument('--temperature', type=float, default=0.7, help='Temperature')
parser.add_argument('--max-tokens', type=int, default=1000, help='Max tokens')
parser.add_argument('--stream', action='store_true', help='Stream response')
parser.add_argument('prompt', nargs='?', help='Prompt to send')
args = parser.parse_args()
client = OllamaClient()
if args.prompt:
# Single prompt mode
if args.stream:
for chunk in client.generate(
args.prompt,
model=args.model,
stream=True
):
print(chunk, end='', flush=True)
print()
else:
response = client.generate(args.prompt, model=args.model)
print(response)
else:
# Interactive mode
print(f"Starting chat with {args.model}. Type 'quit' to exit.")
while True:
try:
user_input = input("You: ")
if user_input.lower() in ['quit', 'exit', 'q']:
break
if args.stream:
print("AI: ", end='', flush=True)
for chunk in client.generate(user_input, model=args.model, stream=True):
print(chunk, end='', flush=True)
print()
else:
response = client.generate(user_input, model=args.model)
print(f"AI: {response}")
except KeyboardInterrupt:
print("\nGoodbye!")
break
if __name__ == '__main__':
main()
Best Practices
Model Management
- Choose Right Size: Select model size based on your hardware capabilities
- Use Quantization: Employ quantized models to reduce memory usage
- Monitor Resources: Keep track of RAM and GPU usage
- Regular Updates: Update Ollama and models regularly
- Backup Custom Models: Export and backup your custom model configurations
Performance Tips
# Performance optimization script
import psutil
import GPUtil
def check_system_resources():
# Check RAM usage
ram = psutil.virtual_memory()
print(f"RAM Usage: {ram.percent}% ({ram.used//1024**3}GB/{ram.total//1024**3}GB)")
# Check GPU usage if available
try:
gpus = GPUtil.getGPUs()
for gpu in gpus:
print(f"GPU {gpu.id}: {gpu.load*100}% load, {gpu.memoryUtil*100}% memory")
except:
print("No GPU detected or GPU monitoring not available")
# Check disk space for models
disk = psutil.disk_usage('/')
print(f"Disk Space: {disk.percent}% used")
def optimize_ollama_settings():
settings = {
"OLLAMA_NUM_PARALLEL": min(4, psutil.cpu_count()),
"OLLAMA_MAX_LOADED_MODELS": 2,
}
if psutil.virtual_memory().total < 16 * 1024**3: # Less than 16GB RAM
settings["OLLAMA_NUM_PARALLEL"] = 2
settings["OLLAMA_MAX_LOADED_MODELS"] = 1
return settings
# Usage
if __name__ == "__main__":
check_system_resources()
optimal_settings = optimize_ollama_settings()
print("Recommended settings:", optimal_settings)