SPS logo

SPS Dev Tool Guides

Shared AI infrastructure for the Sand Point Studios dev team

Ollama — local LLM inference

What it is. A daemon that runs open-weight LLMs (Qwen, Llama, DeepSeek, etc.) on your GPU and exposes them over an OpenAI-compatible HTTP API. Drop-in replacement for Claude/OpenAI clients when you don't need top-tier reasoning.

Where it lives

Location Endpoint Use case
sps-srv1 LAN http://vigil-server:11434 or http://192.168.0.21:11434 From other apps on home LAN / Tailscale
sps-srv1 public (vigil-gateway) https://ai.teaganwins.net Dev team, off-LAN, requires API key
laptop local http://localhost:11434 Portable offline use

Models available

sps-srv1 (32 GB VRAM via 5090 + 8 GB via 2060S)

Model Size Best for
llama3.3:70b-instruct-q4_K_M 42 GB Hardest reasoning (closest to Claude Sonnet on quality)
qwen2.5:32b-instruct-q4_K_M 20 GB General-purpose strong reasoning
qwen3:32b 20 GB Newer Qwen, often better than 2.5 for code
qwen2.5-coder:32b-instruct-q4_K_M 19 GB Code generation specifically
qwen3:14b 9 GB Faster general-purpose
qwen2.5-coder:7b 5 GB Fastest code completion
llama3.1:8b 5 GB Fast general-purpose
llama3.2-vision:11b 8 GB Image → text (OCR, photo description)
moondream 1.7 GB Tiny vision model, fast iteration
minicpm-v 5 GB Strong multimodal
deepseek-coder:6.7b 4 GB Code-tuned small model
nomic-embed-text 274 MB Embeddings for RAG (NOT chat)

laptop (16 GB VRAM)

qwen2.5:14b, qwen2.5-coder:14b, qwen3:14b, moondream, llama3.2-vision:11b, llama3.1:8b, nomic-embed-text. No 32B+ models — won't fit.

When to use Ollama vs. Claude API

Use Ollama when: - Iterating on a prompt (running it 20 times to A/B test wording) - Code completion / autocomplete in an editor - Embeddings for RAG (nomic-embed-text — no reason to pay OpenAI) - Throwaway "what's a regex for X" or "explain this snippet" - Tasks where wrong answers are cheap (validation happens elsewhere) - Privacy-sensitive: input contains FERPA data, secrets, or anything you don't want leaving the home network

Keep using Claude API when: - Production code paths (LedgerLearner grading, Vigil Steward, etc.) - Long-context (>32k tokens routinely) - Anything where Sonnet's reasoning quality is the load-bearing piece - Agent loops with tool use (Claude is dramatically better at tool calling than open models)

3 recipes

1. Quick CLI test from anywhere

From the laptop or sps-srv1:

ollama run qwen2.5-coder:32b
# (interactive prompt — type, get response, /bye to exit)

One-shot:

ollama run qwen2.5:32b "Explain CTEs in Postgres in 3 bullets"

2. Swap a Python script from Anthropic to local Ollama

Before:

from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(model="claude-sonnet-4-20250514", ...)

After (uses the OpenAI SDK pointed at Ollama):

from openai import OpenAI
client = OpenAI(base_url="http://vigil-server:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="qwen2.5:32b-instruct-q4_K_M",
    messages=[{"role": "user", "content": "..."}],
)

Same client.chat.completions.create() API. Swap base URL for http://localhost:11434/v1 on laptop, or https://ai.teaganwins.net/v1 from off-LAN (needs API key in BW).

3. Code completion in VS Code

Install the Continue extension (continue.continue in marketplace). Add to ~/.continue/config.json:

{
  "models": [
    {
      "title": "qwen2.5-coder-32b (sps-srv1)",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b-instruct-q4_K_M",
      "apiBase": "http://vigil-server:11434"
    }
  ]
}

Cmd/Ctrl+I to invoke. Free, no API key, no rate limits.

Gotchas