Kokoro TTS — text-to-speech

What it is. A small (82M parameter) fast TTS model, OpenAI-compatible API. ~50 voices across American/British, male/female. CPU-fast on the Ryzen 3700X (~1-2× realtime — i.e. 60 seconds of audio in ~30-60 seconds of compute). Drop-in for OpenAI's TTS API at zero per-character cost.

Where it lives

Location	Endpoint	Auth
sps-srv1 service (shared)	`https://tts.sandpointstudios.ltd/v1/audio/speech`	CF Access (Google SSO, `@sandpointstudios.ltd`)
sps-srv1 LAN	`http://vigil-server:8007/v1/audio/speech`	None

Voices

50+ available. Naming pattern: <accent><gender>_<name>. Common ones:

Voice	Notes
`af_heart`	American female, warm — good default for narration
`af_nicole`	American female, neutral
`af_sky`	American female, brighter
`am_michael`	American male, default newsreader
`am_adam`	American male, slightly higher
`bf_emma`	British female
`bm_george`	British male

Full list: curl http://vigil-server:8007/v1/audio/voices

When to use Kokoro vs. OpenAI TTS / ElevenLabs

Use Kokoro when: - Generating bulk audio (lecture narration, demo videos, audiobook drafts) — unlimited, no cost - Privacy: course content with student names, FERPA-adjacent - Quick iteration: regenerate the same line with different voices to A/B - Anywhere "good but not perfect" voice quality is fine

Keep using ElevenLabs/OpenAI TTS when: - Production-grade narration for a marketing video (ElevenLabs is still better-emotive) - Voice cloning (Kokoro doesn't clone — for that, Coqui XTTS is the future upgrade path) - Specific accents or non-English support beyond what Kokoro ships

3 recipes

1. One-line voice sample (curl, any machine on the LAN/Tailscale)

curl -X POST http://vigil-server:8007/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Welcome to BUS four ninety one.","voice":"af_heart","response_format":"wav"}' \
  --output out.wav

Voice defaults to af_heart server-side if you omit it. See the voices table above.

2. Narrate a slide deck (Python, from any project)

from openai import OpenAI
client = OpenAI(base_url="http://vigil-server:8007/v1", api_key="kokoro")

slides = [
    "Slide one: introduction to managerial accounting.",
    "Slide two: cost behavior — fixed versus variable.",
    # ...
]
for i, text in enumerate(slides, 1):
    audio = client.audio.speech.create(
        model="kokoro",
        voice="af_heart",
        input=text,
        response_format="mp3",
    )
    with open(f"slide-{i:02d}.mp3", "wb") as f:
        f.write(audio.read())

The endpoint speaks OpenAI's /v1/audio/speech protocol verbatim — any OpenAI TTS code works after swapping base_url.

3. Generate a demo voiceover with ffmpeg

# 1. Get the audio
curl -X POST http://vigil-server:8007/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Your script here.","voice":"am_michael","response_format":"wav"}' \
  --output narration.wav

# 2. Mux with a silent video track (or use a real video)
ffmpeg -i screen-recording.mp4 -i narration.wav -c:v copy -c:a aac -shortest demo.mp4

Gotchas

CPU inference is ~1-2× realtime. Generating a 5-minute narration takes ~3-5 minutes. Fine for batch jobs; not interactive. (GPU-accelerated Kokoro on the 5090 is a planned upgrade pending a Blackwell-capable container image.)
No voice cloning. If you want to narrate content in your own voice, Kokoro can't do that — you'd need Coqui XTTS. Open a ticket if this becomes a need; XTTS-on-sps-srv1 is a feasible follow-up.
Voice selection matters more than people expect. Test 3-4 voices on a sample sentence before committing to one for a long project.
Accent fidelity: the b*_ British voices are noticeably less natural than American ones. Stick with af_* / am_* unless British is required.
CF Access for the public URL: team members hit https://tts.sandpointstudios.ltd/... in a browser → they get a Google sign-in page. Programmatic access from off-LAN scripts needs a CF Access service token (mint one in the CF dashboard or via API; store in BW).
No streaming. The endpoint returns the full audio file. For long inputs, chunk them client-side and stitch with ffmpeg.