SPS logo

SPS Dev Tool Guides

Shared AI infrastructure for the Sand Point Studios dev team

Kokoro TTS — text-to-speech

What it is. A small (82M parameter) fast TTS model, OpenAI-compatible API. ~50 voices across American/British, male/female. CPU-fast on the Ryzen 3700X (~1-2× realtime — i.e. 60 seconds of audio in ~30-60 seconds of compute). Drop-in for OpenAI's TTS API at zero per-character cost.

Where it lives

Location Endpoint Auth
sps-srv1 service (shared) https://tts.sandpointstudios.ltd/v1/audio/speech CF Access (Google SSO, @sandpointstudios.ltd)
sps-srv1 LAN http://vigil-server:8007/v1/audio/speech None
laptop CLI python tts.py "text" at C:\Users\twist\AppData\Local\Kokoro\ None — fully offline

Voices

50+ available. Naming pattern: <accent><gender>_<name>. Common ones:

Voice Notes
af_heart American female, warm — good default for narration
af_nicole American female, neutral
af_sky American female, brighter
am_michael American male, default newsreader
am_adam American male, slightly higher
bf_emma British female
bm_george British male

Full list: curl http://vigil-server:8007/v1/audio/voices

When to use Kokoro vs. OpenAI TTS / ElevenLabs

Use Kokoro when: - Generating bulk audio (lecture narration, demo videos, audiobook drafts) — unlimited, no cost - Privacy: course content with student names, FERPA-adjacent - Quick iteration: regenerate the same line with different voices to A/B - Anywhere "good but not perfect" voice quality is fine

Keep using ElevenLabs/OpenAI TTS when: - Production-grade narration for a marketing video (ElevenLabs is still better-emotive) - Voice cloning (Kokoro doesn't clone — for that, Coqui XTTS is the future upgrade path) - Specific accents or non-English support beyond what Kokoro ships

3 recipes

1. One-line voice sample from the laptop

cd C:\Users\twist\AppData\Local\Kokoro
.venv\Scripts\python.exe tts.py "Welcome to BUS four ninety one." af_heart
# writes out.wav in the same dir

The tts.py script accepts python tts.py <text> [voice]. Voice defaults to af_heart.

2. Narrate a slide deck (Python, from any project)

from openai import OpenAI
client = OpenAI(base_url="http://vigil-server:8007/v1", api_key="kokoro")

slides = [
    "Slide one: introduction to managerial accounting.",
    "Slide two: cost behavior — fixed versus variable.",
    # ...
]
for i, text in enumerate(slides, 1):
    audio = client.audio.speech.create(
        model="kokoro",
        voice="af_heart",
        input=text,
        response_format="mp3",
    )
    with open(f"slide-{i:02d}.mp3", "wb") as f:
        f.write(audio.read())

The endpoint speaks OpenAI's /v1/audio/speech protocol verbatim — any OpenAI TTS code works after swapping base_url.

3. Generate a demo voiceover with ffmpeg

# 1. Get the audio
curl -X POST http://vigil-server:8007/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Your script here.","voice":"am_michael","response_format":"wav"}' \
  --output narration.wav

# 2. Mux with a silent video track (or use a real video)
ffmpeg -i screen-recording.mp4 -i narration.wav -c:v copy -c:a aac -shortest demo.mp4

Gotchas