Ollama — local LLM inference
What it is. A daemon that runs open-weight LLMs (Qwen, Llama, DeepSeek, etc.) on your GPU and exposes them over an OpenAI-compatible HTTP API. Drop-in replacement for Claude/OpenAI clients when you don't need top-tier reasoning.
Where it lives
| Location | Endpoint | Use case |
|---|---|---|
| sps-srv1 LAN | http://vigil-server:11434 or http://192.168.0.21:11434 |
From other apps on home LAN / Tailscale |
| sps-srv1 public (vigil-gateway) | https://ai.teaganwins.net |
Dev team, off-LAN, requires API key |
| laptop local | http://localhost:11434 |
Portable offline use |
Models available
sps-srv1 (32 GB VRAM via 5090 + 8 GB via 2060S)
| Model | Size | Best for |
|---|---|---|
llama3.3:70b-instruct-q4_K_M |
42 GB | Hardest reasoning (closest to Claude Sonnet on quality) |
qwen2.5:32b-instruct-q4_K_M |
20 GB | General-purpose strong reasoning |
qwen3:32b |
20 GB | Newer Qwen, often better than 2.5 for code |
qwen2.5-coder:32b-instruct-q4_K_M |
19 GB | Code generation specifically |
qwen3:14b |
9 GB | Faster general-purpose |
qwen2.5-coder:7b |
5 GB | Fastest code completion |
llama3.1:8b |
5 GB | Fast general-purpose |
llama3.2-vision:11b |
8 GB | Image → text (OCR, photo description) |
moondream |
1.7 GB | Tiny vision model, fast iteration |
minicpm-v |
5 GB | Strong multimodal |
deepseek-coder:6.7b |
4 GB | Code-tuned small model |
nomic-embed-text |
274 MB | Embeddings for RAG (NOT chat) |
laptop (16 GB VRAM)
qwen2.5:14b, qwen2.5-coder:14b, qwen3:14b, moondream, llama3.2-vision:11b, llama3.1:8b, nomic-embed-text. No 32B+ models — won't fit.
When to use Ollama vs. Claude API
Use Ollama when:
- Iterating on a prompt (running it 20 times to A/B test wording)
- Code completion / autocomplete in an editor
- Embeddings for RAG (nomic-embed-text — no reason to pay OpenAI)
- Throwaway "what's a regex for X" or "explain this snippet"
- Tasks where wrong answers are cheap (validation happens elsewhere)
- Privacy-sensitive: input contains FERPA data, secrets, or anything you don't want leaving the home network
Keep using Claude API when: - Production code paths (LedgerLearner grading, Vigil Steward, etc.) - Long-context (>32k tokens routinely) - Anything where Sonnet's reasoning quality is the load-bearing piece - Agent loops with tool use (Claude is dramatically better at tool calling than open models)
3 recipes
1. Quick CLI test from anywhere
From the laptop or sps-srv1:
ollama run qwen2.5-coder:32b
# (interactive prompt — type, get response, /bye to exit)
One-shot:
ollama run qwen2.5:32b "Explain CTEs in Postgres in 3 bullets"
2. Swap a Python script from Anthropic to local Ollama
Before:
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(model="claude-sonnet-4-20250514", ...)
After (uses the OpenAI SDK pointed at Ollama):
from openai import OpenAI
client = OpenAI(base_url="http://vigil-server:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
model="qwen2.5:32b-instruct-q4_K_M",
messages=[{"role": "user", "content": "..."}],
)
Same client.chat.completions.create() API. Swap base URL for http://localhost:11434/v1 on laptop, or https://ai.teaganwins.net/v1 from off-LAN (needs API key in BW).
3. Code completion in VS Code
Install the Continue extension (continue.continue in marketplace). Add to ~/.continue/config.json:
{
"models": [
{
"title": "qwen2.5-coder-32b (sps-srv1)",
"provider": "ollama",
"model": "qwen2.5-coder:32b-instruct-q4_K_M",
"apiBase": "http://vigil-server:11434"
}
]
}
Cmd/Ctrl+I to invoke. Free, no API key, no rate limits.
Gotchas
- First request is slow. Ollama loads the model into VRAM on first call. Subsequent calls in the same session are fast. If you switch models, it unloads/reloads. Pin one model per workflow.
- 70B is slow on Q4. Expect ~5-10 tok/s on the 5090. Use 32B unless you specifically need the quality jump.
- vigil-gateway requires API key, sps-srv1 LAN doesn't. For team scripts that run off-LAN (CI, Raghav's laptop on Tailscale), use the gateway endpoint with a key from BW.
- Context length varies by model. Qwen2.5 = 32k. Llama 3.3 = 128k. Check before piping huge documents.
- Vision models want images base64-encoded in the request — use the SDK helpers, don't hand-craft.