Whisper — speech to text

What it is. OpenAI's Whisper-large-v3 served via faster-whisper-server on sps-srv1's RTX 5090. Transcribes audio/video to text over an OpenAI-compatible HTTP API. Real-time-or-better — a 60-minute lecture transcribes in ~3-5 minutes on Blackwell. Free, private (stays on the home network), no cloud upload.

Where it lives

Location	Endpoint	Auth
sps-srv1 LAN	`http://vigil-server:8008/v1/audio/transcriptions`	None (LAN-only)
sps-srv1 Tailscale	`http://100.98.48.21:8008/v1/audio/transcriptions`	Tailscale ACL

The container is faster-whisper-server:latest-cuda pinned to GPU 0 (the 5090); model large-v3 is cached at /root/.cache in the named volume whisper-models.

When to use Whisper vs. OpenAI Whisper API / others

Use sps-srv1 Whisper when: - Course content with student names / FERPA-adjacent (don't upload to OpenAI) - Anything you don't want sitting on a 3rd party's drive - Bulk transcription (no per-minute API cost) - Have time to wait (a 90-min lecture = ~5 min on the 5090)

Use OpenAI Whisper API when: - Off-network and no Tailscale handy - Diarization is the load-bearing piece (their API has it; local needs WhisperX upgrade)

3 recipes

1. Transcribe a single lecture file (curl, any machine on the LAN/Tailscale)

curl -X POST http://vigil-server:8008/v1/audio/transcriptions \
  -H "Authorization: Bearer none" \
  -F "file=@BUS-491-lesson-12-Professional-Toolkit.mp4" \
  -F "model=large-v3" \
  -F "language=en" \
  -F "response_format=text"

The endpoint speaks OpenAI's /v1/audio/transcriptions protocol verbatim — any OpenAI SDK code works after swapping base_url.

2. Python — bulk transcribe a folder, write SRT captions

import pathlib
from openai import OpenAI

client = OpenAI(base_url="http://vigil-server:8008/v1", api_key="none")

inbox = pathlib.Path("./inbox")
outbox = pathlib.Path("./transcripts")
outbox.mkdir(exist_ok=True)

for video in inbox.glob("*.mp4"):
    srt_path = outbox / video.with_suffix(".srt").name
    if srt_path.exists():
        continue
    with video.open("rb") as f:
        srt = client.audio.transcriptions.create(
            file=(video.name, f, "video/mp4"),
            model="large-v3",
            language="en",
            response_format="srt",
        )
    srt_path.write_text(srt if isinstance(srt, str) else srt.text)
    print(f"  → {srt_path.name}")

Then ffmpeg -i demo.mp4 -vf subtitles=demo.srt demo-captioned.mp4 to burn captions in, or upload the .srt separately.

3. JSON segments with timestamps (when you want word-level alignment)

from openai import OpenAI
client = OpenAI(base_url="http://vigil-server:8008/v1", api_key="none")

with open("lecture.mp4", "rb") as f:
    result = client.audio.transcriptions.create(
        file=f,
        model="large-v3",
        language="en",
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )

for seg in result.segments:
    print(f"[{seg.start:7.2f} → {seg.end:7.2f}] {seg.text}")

Gotchas

First request is slow. The model loads into VRAM on first call. Subsequent calls in the same session are fast.
large-v3 is the right pick. Smaller models (medium, small, tiny) are tempting but the quality drop is noticeable for course content with terminology.
Speaker diarization is NOT included. For "who said what" you need WhisperX (which wraps faster-whisper + pyannote). Open a ticket if you regularly transcribe meetings with 3+ speakers.
Audio preprocessing matters. Quiet ambient hum, music under speech, multiple overlapping speakers → degraded transcript. Strip background noise with ffmpeg -af afftdn before transcribing if recording quality is low.
VAD (voice activity detection) filter is available via vad_filter=true — useful for long files with silence (skips it, faster). Pass as a form field if calling curl directly; the OpenAI SDK exposes it through extra params.
Container restarts drop the in-VRAM model. First call after docker restart faster-whisper-server will be slow again. The model cache on disk survives restarts so re-load is from disk (fast), not re-download.