How I built an evaluation pipeline for speech-to-text models across 3 providers, 5 languages, and 28 audio files.