Stage 01
Audio Capture
From air to numbers
Sound starts as air pressure vibrations from your vocal cords. Your device's microphone converts those vibrations into electrical voltage. Sampling measures that voltage thousands of times per second(frequency), turning analog sound into a list of numbers. This is a mapping of your voice to an entire digital signal(numbers).
One second of speech at 16kHz mono is 64KB. Just 16,000 floating point numbers between −1.0 and 1.0. That's the raw material the entire pipeline works with.
The microphone on a Mac records at 48kHz. Much more information than what the speech recognition model needs. The model requires 16kHz, so we first have to resample via linear interpolation: blending the two nearest original values at evenly-spaced positions. The captured audio typically arrives in chunks of 4,096 samples every ~85ms through an installTap() callback. Think of it as a tap on the audio pipe, delivering data as it flows.
Stage 02
Voice Detection
Filtering the silence
When are you actually talking? When should the microphone start/stop recording? That's the first engineering challenge. With local speech recognition models, voice activity detection(VAD) is classified as 30ms audio frames as speech or silence. Without it, the inference engine wastes cycles processing keyboard clicks, fan noise, and dead air.
A small neural network then outputs a probability score per frame. Above the threshold, it's speech. Below, it's silence. Only speech frames pass through to inference.
Stage 03
Inference
A picture of sound, then words
First, the model converts raw audio into a mel spectrogram. It's a heatmap where X is time, Y is frequency scaled to human hearing, and brightness is loudness. Like freezing a music equalizer display every few milliseconds and stacking the snapshots.
Your list of numbers just became a picture. Speech recognition is a computer vision problem!
An encoder (neural network) studies that picture and produces a compressed summary. Not text yet. Think of a court stenographer pressing shorthand keys, capturing phonetic structure rather than English words - "breathy onset, strong mid-vowel, liquid consonant". The summary captures what the sound means without committing to specific words.
Then a decoder (small language model) generates text one token at a time. It uses attention to check which part of the audio corresponds to the next word. Like a translator who listens to the full speech, then translates sentence by sentence.
Stage 04
Text Cleanup
Cleaning up raw speech
Even a perfect transcription captures every filler word with perfect accuracy. "Um so I was like you know thinking about it" transcribed exactly as spoken reads terribly.
100% transcription accuracy doesn't mean 100% useful output. The fillers are noise, not signal.
Post-processing fixes this. Filler words get removed ("um", "uh", "like"). Punctuation gets predicted. Capitalization gets fixed. Inverse text normalization turns "five dollars" into "$5.00" and "twenty second of may" into "May 22nd." Simple implementations use about 15 regex patterns at sub-millisecond speed. More sophisticated systems use a language model to rewrite the full output.
Stage 05
Text Insertion
The invisible UX
How does the transcribed text get from the STT app into your email, Slack message, or code editor? It's simple - it's literally a "Paste" action from your clipboard! On macOS, every keypress creates a Core Graphics Event(CGEvent): a digital record saying "this key was pressed." You can create fake CGEvents in code. They're indistinguishable from real keypresses. Puppet strings attached to the keyboard.
Save the clipboard. Write the transcription to it. Simulate ⌘V. Restore the clipboard. It's a hack. It requires Accessibility permissions. And it works everywhere.
Real ⌘V is four physical actions: Command DOWN, V DOWN, V UP, Command UP. Skip any one and it breaks. This clipboard hijack is also why every dictation app (Wispr Flow, VoiceInk, SuperWhisper) distributes outside the Mac App Store. Sandboxing forbids CGEvent posting, Apple Events, and global hotkeys.