What is ASR (Automatic Speech Recognition)?

Q: What does ASR stand for?

ASR stands for Automatic Speech Recognition — the technology that converts spoken audio into text in real time.

ASR Explained: Converting Speech to Text in Real Time

Automatic Speech Recognition (ASR) is a category of artificial intelligence that listens to audio and converts spoken words into written text — in real time. It's the core technology powering live captions, voice assistants, dictation software, and real-time translation tools like StreamTranslate.

Modern ASR systems use deep neural networks trained on millions of hours of audio data. These models learn to recognize phonemes (individual sounds), words, and full sentences across different accents, speaking speeds, and acoustic environments. The best systems — like Deepgram Nova-2, which StreamTranslate uses — can process audio and return transcribed text in under 200 milliseconds.

For live streamers, ASR quality directly determines caption accuracy. A low-quality ASR engine will produce garbled captions full of errors, which frustrates viewers and undermines accessibility. A high-quality ASR model like Nova-2 produces clean, readable captions even through background game audio, microphone noise, and rapid speech.

How ASR Powers StreamTranslate

StreamTranslate connects directly to your stream via an OBS browser source or API integration. Your microphone audio is sent to Deepgram's Nova-2 ASR engine, which returns transcribed text in under 300ms. That text then flows through StreamTranslate's translation pipeline to appear in 125+ languages as a live overlay on your stream.

The key technical advantage here is streaming ASR — rather than waiting for a complete sentence before transcribing, Nova-2 processes audio in small chunks as you speak. This keeps latency low enough that captions appear to viewers almost as the words leave your mouth, rather than several seconds behind.

StreamTranslate also supports custom vocabulary, which trains the ASR engine to correctly recognize game names, streamer handles, community-specific terms, and product names that general-purpose models might mishear. This is especially important for gaming streams where standard vocabulary fails.

Deepgram Nova-2

StreamTranslate uses Nova-2, the most accurate streaming ASR model available, with industry-leading word error rates on live audio.

Sub-300ms Latency

Streaming ASR delivers words back fast enough for live captions to feel synchronous with your speech, not delayed by seconds.

125+ Languages

After ASR transcribes your speech in English, StreamTranslate's NMT layer translates it into over 125 languages for global audiences.

Frequently Asked Questions

What does ASR stand for?

ASR stands for Automatic Speech Recognition — the AI technology that converts spoken audio into text in real time.

How accurate is modern ASR for live streams?

Modern ASR models like Deepgram Nova-2 achieve over 99% accuracy on clear speech, making them fully suitable for professional live streaming captions.

How does StreamTranslate use ASR?

StreamTranslate routes your stream audio through Deepgram Nova-2 ASR to generate live captions in under 300ms, then optionally translates them into 125+ languages.

Does ASR work with gaming vocabulary?

Yes. StreamTranslate supports custom vocabulary so gaming terms, streamer names, and niche phrases are recognized correctly.

Is ASR better than manual captioning for live streams?

For live streaming, ASR is the only practical option since human captioners cannot match the speed required. Modern ASR accuracy rivals human transcription for standard speech.