sound_detection_loud_sound Exploring Speech-to-Text

STT Pipeline Visualized

How does speech‑to‑text actually work under the hood?

Speech-to-text is five stages of signal processing, neural inference, and post-processing. All of it runs in under 500 milliseconds. Here's what each stage actually does:

mic

Stage 01

Audio Capture

From air to numbers

Sound starts as air pressure vibrations from your vocal cords. Your device's microphone converts those vibrations into electrical voltage. Sampling measures that voltage thousands of times per second(frequency), turning analog sound into a list of numbers. This is a mapping of your voice to an entire digital signal(numbers).

One second of speech at 16kHz mono is 64KB. Just 16,000 floating point numbers between −1.0 and 1.0. That's the raw material the entire pipeline works with.

The microphone on a Mac records at 48kHz. Much more information than what the speech recognition model needs. The model requires 16kHz, so we first have to resample via linear interpolation: blending the two nearest original values at evenly-spaced positions. The captured audio typically arrives in chunks of 4,096 samples every ~85ms through an installTap() callback. Think of it as a tap on the audio pipe, delivering data as it flows.

16,000 Hz 4,096-sample chunks ~85ms per delivery
Waveform 16kHz Mono
graphic_eq

Stage 02

Voice Detection

Filtering the silence

When are you actually talking? When should the microphone start/stop recording? That's the first engineering challenge. With local speech recognition models, voice activity detection(VAD) is classified as 30ms audio frames as speech or silence. Without it, the inference engine wastes cycles processing keyboard clicks, fan noise, and dead air.

A small neural network then outputs a probability score per frame. Above the threshold, it's speech. Below, it's silence. Only speech frames pass through to inference.
30ms frames 0.0–1.0 threshold
Speech Detection Threshold: 0.5
Speech probability 0.00
Silence Speech Silence
psychology

Stage 03

Inference

A picture of sound, then words

First, the model converts raw audio into a mel spectrogram. It's a heatmap where X is time, Y is frequency scaled to human hearing, and brightness is loudness. Like freezing a music equalizer display every few milliseconds and stacking the snapshots.

Your list of numbers just became a picture. Speech recognition is a computer vision problem!

An encoder (neural network) studies that picture and produces a compressed summary. Not text yet. Think of a court stenographer pressing shorthand keys, capturing phonetic structure rather than English words - "breathy onset, strong mid-vowel, liquid consonant". The summary captures what the sound means without committing to specific words.

Then a decoder (small language model) generates text one token at a time. It uses attention to check which part of the audio corresponds to the next word. Like a translator who listens to the full speech, then translates sentence by sentence.

6.65% WER 30s chunk limit
Spectrogram → Tokens Encoder-Decoder
Token output
auto_fix_high

Stage 04

Text Cleanup

Cleaning up raw speech

Even a perfect transcription captures every filler word with perfect accuracy. "Um so I was like you know thinking about it" transcribed exactly as spoken reads terribly.

100% transcription accuracy doesn't mean 100% useful output. The fillers are noise, not signal.

Post-processing fixes this. Filler words get removed ("um", "uh", "like"). Punctuation gets predicted. Capitalization gets fixed. Inverse text normalization turns "five dollars" into "$5.00" and "twenty second of may" into "May 22nd." Simple implementations use about 15 regex patterns at sub-millisecond speed. More sophisticated systems use a language model to rewrite the full output.

~15 filler patterns <1ms regex
Raw output
So um I was thinking uh that we should like probably um schedule a meeting for like next Tuesday
Cleaned
I was thinking that we should probably schedule a meeting for next Tuesday.
keyboard_return

Stage 05

Text Insertion

The invisible UX

How does the transcribed text get from the STT app into your email, Slack message, or code editor? It's simple - it's literally a "Paste" action from your clipboard! On macOS, every keypress creates a Core Graphics Event(CGEvent): a digital record saying "this key was pressed." You can create fake CGEvents in code. They're indistinguishable from real keypresses. Puppet strings attached to the keyboard.

Save the clipboard. Write the transcription to it. Simulate ⌘V. Restore the clipboard. It's a hack. It requires Accessibility permissions. And it works everywhere.

Real ⌘V is four physical actions: Command DOWN, V DOWN, V UP, Command UP. Skip any one and it breaks. This clipboard hijack is also why every dictation app (Wispr Flow, VoiceInk, SuperWhisper) distributes outside the Mac App Store. Sandboxing forbids CGEvent posting, Apple Events, and global hotkeys.

4 CGEvents <50ms latency Accessibility required
1 Save clipboard
2 Write text
3 Simulate ⌘V
4 Restore
Notes — Untitled