Glossary
What is
Automatic Speech Recognition (ASR)?
ASR is the AI technology that converts your voice to text in real time — powering live captions, translation, and accessibility for streamers.
Get Started Free
Definition
Automatic Speech Recognition (ASR), also called Speech-to-Text (STT), is the technology that processes audio input and converts it to written text automatically using AI and machine learning models.
Modern ASR systems achieve near-human accuracy for clear speech in common languages, operating in real time with sub-second latency.
How ASR Works
- Audio preprocessing: Noise reduction and audio normalization
- Acoustic modeling: Maps audio signals to phonemes
- Language modeling: Predicts word sequences using context
- Decoding: Combines acoustic and language models to produce text output
- Streaming output: Real-time word delivery as speech continues
ASR Providers Compared
StreamTranslate uses Deepgram for ASR. Here's how leading ASR providers compare for live streaming use cases:
- Deepgram Nova-2: Fastest (~0.3s), best accuracy for streaming
- Google Speech-to-Text: Good accuracy, ~0.8s latency
- AWS Transcribe Streaming: Reliable, ~1-1.5s latency
- Azure Speech: Strong multilingual support, ~0.5s latency
- Whisper (OpenAI): Excellent accuracy, but not built for real-time
Pricing
- Stream Pass — $9.99: One full stream session, all languages
- Starter — $14.99/mo: 25 hours/month, single language
- Pro — $34.99/mo: 40 hours/month, dual language
- Unlimited — $79.99/mo: Unlimited hours, dual language
See full pricing →