Exploring Speech-to-Text

mic

Stage 01

Audio Capture

From air to numbers

Sound starts as air pressure vibrations from your vocal cords. Your device's microphone converts those vibrations into electrical voltage. Sampling measures that voltage thousands of times per second(frequency), turning analog sound into a list of numbers. This is a mapping of your voice to an entire digital signal(numbers).

One second of speech at 16kHz mono is 64KB. Just 16,000 floating point numbers between −1.0 and 1.0. That's the raw material the entire pipeline works with.

The microphone on a Mac records at 48kHz. Much more information than what the speech recognition model needs. The model requires 16kHz, so we first have to resample via linear interpolation: blending the two nearest original values at evenly-spaced positions. The captured audio typically arrives in chunks of 4,096 samples every ~85ms through an installTap() callback. Think of it as a tap on the audio pipe, delivering data as it flows.

16,000 Hz 4,096-sample chunks ~85ms per delivery

Waveform 16kHz Mono

graphic_eq

Stage 02

Voice Detection

Filtering the silence

When are you actually talking? When should the microphone start/stop recording? That's the first engineering challenge. With local speech recognition models, voice activity detection(VAD) is classified as 30ms audio frames as speech or silence. Without it, the inference engine wastes cycles processing keyboard clicks, fan noise, and dead air.

A small neural network then outputs a probability score per frame. Above the threshold, it's speech. Below, it's silence. Only speech frames pass through to inference.

30ms frames 0.0–1.0 threshold

Speech Detection Threshold: 0.5

Speech probability 0.00

Silence Speech Silence

psychology

Stage 03

Inference

A picture of sound, then words

First, the model converts raw audio into a mel spectrogram. It's a heatmap where X is time, Y is frequency scaled to human hearing, and brightness is loudness. Like freezing a music equalizer display every few milliseconds and stacking the snapshots.

Your list of numbers just became a picture. Speech recognition is a computer vision problem!

An encoder (neural network) studies that picture and produces a compressed summary. Not text yet. Think of a court stenographer pressing shorthand keys, capturing phonetic structure rather than English words - "breathy onset, strong mid-vowel, liquid consonant". The summary captures what the sound means without committing to specific words.

Then a decoder (small language model) generates text one token at a time. It uses attention to check which part of the audio corresponds to the next word. Like a translator who listens to the full speech, then translates sentence by sentence.

6.65% WER 30s chunk limit

Spectrogram → Tokens Encoder-Decoder

Token output

auto_fix_high

Stage 04

Text Cleanup

Cleaning up raw speech

Even a perfect transcription captures every filler word with perfect accuracy. "Um so I was like you know thinking about it" transcribed exactly as spoken reads terribly.

100% transcription accuracy doesn't mean 100% useful output. The fillers are noise, not signal.

Post-processing fixes this. Filler words get removed ("um", "uh", "like"). Punctuation gets predicted. Capitalization gets fixed. Inverse text normalization turns "five dollars" into "$5.00" and "twenty second of may" into "May 22nd." Simple implementations use about 15 regex patterns at sub-millisecond speed. More sophisticated systems use a language model to rewrite the full output.

~15 filler patterns <1ms regex

Raw output

So um I was thinking uh that we should like probably um schedule a meeting for like next Tuesday

Cleaned

I was thinking that we should probably schedule a meeting for next Tuesday.

keyboard_return

Stage 05

Text Insertion

The invisible UX

How does the transcribed text get from the STT app into your email, Slack message, or code editor? It's simple - it's literally a "Paste" action from your clipboard! On macOS, every keypress creates a Core Graphics Event(CGEvent): a digital record saying "this key was pressed." You can create fake CGEvents in code. They're indistinguishable from real keypresses. Puppet strings attached to the keyboard.

Save the clipboard. Write the transcription to it. Simulate ⌘V. Restore the clipboard. It's a hack. It requires Accessibility permissions. And it works everywhere.

Real ⌘V is four physical actions: Command DOWN, V DOWN, V UP, Command UP. Skip any one and it breaks. This clipboard hijack is also why every dictation app (Wispr Flow, VoiceInk, SuperWhisper) distributes outside the Mac App Store. Sandboxing forbids CGEvent posting, Apple Events, and global hotkeys.

4 CGEvents <50ms latency Accessibility required

1 Save clipboard

2 Write text

3 Simulate ⌘V

4 Restore

Notes — Untitled

How does speech‑to‑text actually work under the hood?

From air to numbers

Filtering the silence

A picture of sound, then words

Cleaning up raw speech

The invisible UX