Speech-to-text streaming (streaming STT) is the real-time conversion of spoken audio into written text, with partial transcriptions emitted continuously as someone speaks. Unlike batch STT which processes complete audio files, streaming STT delivers results in small chunks with latency under 1 second.
Audio is sent to an STT engine in small chunks (100-500 ms). The engine processes each chunk and emits partial transcriptions, refining them as more audio arrives. Final transcriptions are confirmed when the speaker pauses.
Top streaming STT engines in 2026: Deepgram Nova-2 (6.3% WER, fastest), Google Speech-to-Text (7.1% WER), Whisper v3 streaming (8.2% WER).
Live stream captions, real-time translation, voice assistants, meeting transcription, accessibility tools. StreamTranslate uses streaming STT for under-2-second caption latency.
Start Translating Free →