Kokoro TTS — text-to-speech
What it is. A small (82M parameter) fast TTS model, OpenAI-compatible API. ~50 voices across American/British, male/female. CPU-fast on the Ryzen 3700X (~1-2× realtime — i.e. 60 seconds of audio in ~30-60 seconds of compute). Drop-in for OpenAI's TTS API at zero per-character cost.
Where it lives
| Location | Endpoint | Auth |
|---|---|---|
| sps-srv1 service (shared) | https://tts.sandpointstudios.ltd/v1/audio/speech |
CF Access (Google SSO, @sandpointstudios.ltd) |
| sps-srv1 LAN | http://vigil-server:8007/v1/audio/speech |
None |
| laptop CLI | python tts.py "text" at C:\Users\twist\AppData\Local\Kokoro\ |
None — fully offline |
Voices
50+ available. Naming pattern: <accent><gender>_<name>. Common ones:
| Voice | Notes |
|---|---|
af_heart |
American female, warm — good default for narration |
af_nicole |
American female, neutral |
af_sky |
American female, brighter |
am_michael |
American male, default newsreader |
am_adam |
American male, slightly higher |
bf_emma |
British female |
bm_george |
British male |
Full list: curl http://vigil-server:8007/v1/audio/voices
When to use Kokoro vs. OpenAI TTS / ElevenLabs
Use Kokoro when: - Generating bulk audio (lecture narration, demo videos, audiobook drafts) — unlimited, no cost - Privacy: course content with student names, FERPA-adjacent - Quick iteration: regenerate the same line with different voices to A/B - Anywhere "good but not perfect" voice quality is fine
Keep using ElevenLabs/OpenAI TTS when: - Production-grade narration for a marketing video (ElevenLabs is still better-emotive) - Voice cloning (Kokoro doesn't clone — for that, Coqui XTTS is the future upgrade path) - Specific accents or non-English support beyond what Kokoro ships
3 recipes
1. One-line voice sample from the laptop
cd C:\Users\twist\AppData\Local\Kokoro
.venv\Scripts\python.exe tts.py "Welcome to BUS four ninety one." af_heart
# writes out.wav in the same dir
The tts.py script accepts python tts.py <text> [voice]. Voice defaults to af_heart.
2. Narrate a slide deck (Python, from any project)
from openai import OpenAI
client = OpenAI(base_url="http://vigil-server:8007/v1", api_key="kokoro")
slides = [
"Slide one: introduction to managerial accounting.",
"Slide two: cost behavior — fixed versus variable.",
# ...
]
for i, text in enumerate(slides, 1):
audio = client.audio.speech.create(
model="kokoro",
voice="af_heart",
input=text,
response_format="mp3",
)
with open(f"slide-{i:02d}.mp3", "wb") as f:
f.write(audio.read())
The endpoint speaks OpenAI's /v1/audio/speech protocol verbatim — any OpenAI TTS code works after swapping base_url.
3. Generate a demo voiceover with ffmpeg
# 1. Get the audio
curl -X POST http://vigil-server:8007/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro","input":"Your script here.","voice":"am_michael","response_format":"wav"}' \
--output narration.wav
# 2. Mux with a silent video track (or use a real video)
ffmpeg -i screen-recording.mp4 -i narration.wav -c:v copy -c:a aac -shortest demo.mp4
Gotchas
- CPU inference is ~1-2× realtime. Generating a 5-minute narration takes ~3-5 minutes. Fine for batch jobs; not interactive. (If/when sps-srv1 gets a Blackwell-capable image, GPU acceleration brings it to ~10-20× realtime.)
- No voice cloning. If you want to narrate course content in your own voice, Kokoro can't do that — you'd need Coqui XTTS (deferred upgrade, see laptop-5080-utilization plan).
- Voice selection matters more than people expect. Test 3-4 voices on a sample sentence before committing to one for a long project.
- Accent fidelity: the
b*_British voices are noticeably less natural than American ones. Stick withaf_*/am_*unless British is required. - CF Access for the public URL: team members hit
https://tts.sandpointstudios.ltd/...in a browser → they get a Google sign-in page. Programmatic access from off-LAN scripts needs a CF Access service token (mint one in the CF dashboard or via API; store in BW). - No streaming. The endpoint returns the full audio file. For long inputs, chunk them client-side and stitch with ffmpeg.