Glossary
What is
Speech-to-Text (STT)?
Speech-to-Text is the engine behind live stream captions and translation. StreamTranslate uses Deepgram's STT to convert your voice to text in milliseconds.
Get Started Free
Definition
Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), is the technology that converts spoken audio into written text in real time. STT is the first step in any live caption or translation pipeline.
How STT Works
- Your voice is captured as an audio stream
- An acoustic model identifies phonemes and words from the audio waveform
- A language model provides context to improve word prediction accuracy
- The result is a text transcript produced in near-real-time
- For streaming, this happens continuously with rolling word output
STT in StreamTranslate
StreamTranslate uses Deepgram's Nova-2 model for speech recognition — one of the fastest and most accurate STT engines available. Deepgram achieves industry-leading word error rates for English and many other languages.
- Deepgram Nova-2: sub-300ms transcription latency
- Accurate recognition of gaming terminology, accents, and fast speech
- Supports 50+ languages for multilingual streaming setups
Pricing
- Stream Pass — $9.99: One full stream session, all languages
- Starter — $14.99/mo: 25 hours/month, single language
- Pro — $34.99/mo: 40 hours/month, dual language
- Unlimited — $79.99/mo: Unlimited hours, dual language
See full pricing →