Ollama — local LLM inference

What it is. A daemon that runs open-weight LLMs (Qwen, Llama, DeepSeek, etc.) on your GPU and exposes them over an OpenAI-compatible HTTP API. Drop-in replacement for Claude/OpenAI clients when you don't need top-tier reasoning.

Where it lives

Location	Endpoint	Use case
sps-srv1 LAN	`http://vigil-server:11434` or `http://192.168.0.21:11434`	From other apps on home LAN / Tailscale
sps-srv1 public (vigil-gateway)	`https://ai.teaganwins.net`	Dev team, off-LAN, requires API key

Models available

sps-srv1 — 32 GB VRAM via 5090 + 8 GB via 2060S.

Model	Size	Best for
`llama3.3:70b-instruct-q4_K_M`	42 GB	Hardest reasoning (closest to Claude Sonnet on quality)
`qwen2.5:32b-instruct-q4_K_M`	20 GB	General-purpose strong reasoning
`qwen3:32b`	20 GB	Newer Qwen, often better than 2.5 for code
`qwen2.5-coder:32b-instruct-q4_K_M`	19 GB	Code generation specifically
`qwen3:14b`	9 GB	Faster general-purpose
`qwen2.5-coder:7b`	5 GB	Fastest code completion
`llama3.1:8b`	5 GB	Fast general-purpose
`llama3.2-vision:11b`	8 GB	Image → text (OCR, photo description)
`moondream`	1.7 GB	Tiny vision model, fast iteration
`minicpm-v`	5 GB	Strong multimodal
`deepseek-coder:6.7b`	4 GB	Code-tuned small model
`nomic-embed-text`	274 MB	Embeddings for RAG (NOT chat)

Need a model not in the list? Pull it on sps-srv1: ssh teagan@vigil-server 'docker exec ollama ollama pull <model>'. Or ask Teagan.

When to use Ollama vs. Claude API

Use Ollama when: - Iterating on a prompt (running it 20 times to A/B test wording) - Code completion / autocomplete in an editor - Embeddings for RAG (nomic-embed-text — no reason to pay OpenAI) - Throwaway "what's a regex for X" or "explain this snippet" - Tasks where wrong answers are cheap (validation happens elsewhere) - Privacy-sensitive: input contains FERPA data, secrets, or anything you don't want leaving the home network

Keep using Claude API when: - Production code paths (LedgerLearner grading, Vigil Steward, etc.) - Long-context (>32k tokens routinely) - Anything where Sonnet's reasoning quality is the load-bearing piece - Agent loops with tool use (Claude is dramatically better at tool calling than open models)

3 recipes

1. Quick CLI test (curl, anywhere on the LAN/Tailscale)

curl http://vigil-server:11434/api/generate -d '{
  "model": "qwen2.5:32b-instruct-q4_K_M",
  "prompt": "Explain CTEs in Postgres in 3 bullets",
  "stream": false
}'

Or for interactive: SSH into sps-srv1 and run docker exec -it ollama ollama run qwen2.5-coder:32b.

2. Swap a Python script from Anthropic to local Ollama

Before:

from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(model="claude-sonnet-4-20250514", ...)

After (uses the OpenAI SDK pointed at Ollama):

from openai import OpenAI
client = OpenAI(base_url="http://vigil-server:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="qwen2.5:32b-instruct-q4_K_M",
    messages=[{"role": "user", "content": "..."}],
)

Same client.chat.completions.create() API. Off-LAN? Swap base URL for https://ai.teaganwins.net/v1 and pass your gateway API key (mint one in vigil-gateway admin or grab from BW).

3. Code completion in VS Code

Install the Continue extension (continue.continue in marketplace). Add to ~/.continue/config.json:

{
  "models": [
    {
      "title": "qwen2.5-coder-32b (sps-srv1)",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b-instruct-q4_K_M",
      "apiBase": "http://vigil-server:11434"
    }
  ]
}

Cmd/Ctrl+I to invoke. Free, no API key, no rate limits.

Gotchas

First request is slow. Ollama loads the model into VRAM on first call. Subsequent calls in the same session are fast. If you switch models, it unloads/reloads. Pin one model per workflow.
70B is slow on Q4. Expect ~5-10 tok/s on the 5090. Use 32B unless you specifically need the quality jump.
vigil-gateway requires API key, sps-srv1 LAN doesn't. For team scripts that run off-LAN (CI, Raghav's laptop on Tailscale), use the gateway endpoint with a key from BW.
Context length varies by model. Qwen2.5 = 32k. Llama 3.3 = 128k. Check before piping huge documents.
Vision models want images base64-encoded in the request — use the SDK helpers, don't hand-craft.