Whisper — speech to text
What it is. OpenAI's Whisper-large-v3 running locally via faster-whisper (CTranslate2 backend) on the laptop's RTX 5080. Transcribes audio/video to text. Real-time-or-better — a 60-minute lecture transcribes in ~5-10 minutes on Blackwell. Free, private, no cloud upload.
Set up in April from t041. This is the only tool in this suite that's been operational longest.
Where it lives
| Location | Setup | Use case |
|---|---|---|
| laptop | C:\Users\twist\<path to t041 setup — see project_laptop_5080_utilization.md / t041 conversation log> |
All transcription happens here |
Check the t041 setup log at ~/dev-context/conversation-logs/2026-04-23_t041-whisper-pipeline-setup.md for the exact venv path and CUDA DLL handling on Windows.
When to use Whisper vs. OpenAI Whisper API / others
Use local Whisper when: - Course content with student names / FERPA-adjacent (don't upload to OpenAI) - Anything you don't want sitting on a 3rd party's drive - Bulk transcription (no per-minute API cost) - Have time to wait (a 90-min lecture = ~10 min)
Use OpenAI Whisper API when: - One-off, low-sensitivity, want it back NOW - Diarization is the load-bearing piece (their API has it; local needs WhisperX upgrade)
3 recipes
1. Transcribe a single lecture file
From the laptop, in the t041 venv:
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(
"BUS-491-lesson-12-Professional-Toolkit.mp4",
beam_size=5,
language="en",
)
print(f"Detected language: {info.language} ({info.language_probability:.2f})")
for seg in segments:
print(f"[{seg.start:7.2f} → {seg.end:7.2f}] {seg.text}")
Pipe to a text file:
with open("transcript.txt", "w") as f:
for seg in segments:
f.write(f"[{seg.start:.2f}-{seg.end:.2f}] {seg.text}\n")
2. Generate SRT captions for a demo video
segments, _ = model.transcribe("demo.mp4", language="en")
with open("demo.srt", "w") as f:
for i, seg in enumerate(segments, 1):
start = format_timestamp(seg.start) # implement HH:MM:SS,mmm
end = format_timestamp(seg.end)
f.write(f"{i}\n{start} --> {end}\n{seg.text.strip()}\n\n")
def format_timestamp(s):
h, s = divmod(s, 3600)
m, s = divmod(s, 60)
return f"{int(h):02d}:{int(m):02d}:{s:06.3f}".replace('.', ',')
Then ffmpeg -i demo.mp4 -vf subtitles=demo.srt demo-captioned.mp4 to burn them in, or upload the .srt separately.
3. Folder-watcher: drop in MP4s, get transcripts back
import pathlib, time
from faster_whisper import WhisperModel
INBOX = pathlib.Path("C:/Users/twist/Recordings/inbox")
OUTBOX = pathlib.Path("C:/Users/twist/Recordings/transcripts")
OUTBOX.mkdir(exist_ok=True)
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
while True:
for f in INBOX.glob("*.mp4"):
out = OUTBOX / f.with_suffix(".txt").name
if out.exists():
continue # already done
print(f"Transcribing {f.name}...")
segments, _ = model.transcribe(str(f), language="en")
out.write_text("\n".join(s.text for s in segments))
print(f" → {out.name}")
time.sleep(30)
Run this in a terminal you keep open. Drop lecture recordings into inbox/, transcripts appear in transcripts/.
Gotchas
- CUDA DLLs on Windows.
faster-whisperwheels for Windows DON'T bundle CUDA; you neednvidia-cublas-cu12,nvidia-cudnn-cu12,nvidia-cuda-nvrtc-cu12pip-installed, plus the path-adding shim in your script (see the t041 setup log for the exact_subdirectories). Without this you'll get "libcublas.so not found" errors. large-v3is the right pick. Smaller models (medium, small, tiny) are tempting but the quality drop is noticeable for course content with terminology. Storage is 3 GB on disk — not a constraint.beam_size=5is the sweet spot. Defaultbeam_size=1(greedy) is faster but misses words. 5 is the published OpenAI recommendation; >10 has diminishing returns.- Speaker diarization is NOT in faster-whisper. For "who said what" you need WhisperX (which wraps faster-whisper + pyannote). Add that when you have meetings with 3+ speakers regularly.
- Audio preprocessing matters. Quiet ambient hum, music under speech, multiple overlapping speakers → degraded transcript. Strip background noise with
ffmpeg -af afftdnbefore transcribing if recording quality is low. - VAD (voice activity detection) filter exists:
model.transcribe(..., vad_filter=True)— useful for long files with silence (skips it, faster). Not always wanted (might trim short utterances).