Technical Explainer

How Real-Time Speech Recognition Works (For Non-Technical Streamers)

By StreamTranslate Team · March 23, 2026 · 9 min read

You speak, and 1.4 seconds later, the words appear in Spanish at the bottom of your stream. What actually happens in that 1.4 seconds? This guide explains the full pipeline — from microphone input to translated subtitle — in plain English, with no computer science degree required.

The Big Picture

Real-time speech-to-subtitle translation is a 5-step pipeline. Each step adds a small amount of time; the total is what you experience as "subtitle delay." Modern systems like StreamTranslate optimize each step to keep the total under 2 seconds.

1

Audio Capture

Your microphone audio is captured by the browser source in OBS and streamed to the cloud in real time.

2

Speech Recognition (ASR)

Deepgram's AI converts the audio stream to text. This happens in near-real-time, outputting words as they're spoken.

3

Language Translation

The transcribed English text is passed to a neural machine translation model, which converts it to Spanish, Japanese, or whichever language you've selected.

4

Text Rendering

The translated text is pushed to your browser source overlay, where it's formatted and displayed as a subtitle bar.

5

Video Composite

OBS composites the subtitle overlay onto your video output before encoding and sending to Twitch/YouTube, X, and TikTok.

Step 1: Audio Capture

The browser source in OBS creates a web page that runs inside your streaming software. This page has access to your microphone input (via Web Audio API) and streams the raw audio data over a WebSocket connection to StreamTranslate's servers. The audio is transmitted as compressed chunks every 250–500 milliseconds — small enough to feel real-time, large enough for the AI to process meaningfully.

This step adds: ~50–100ms latency (network transmission time).

Step 2: Automatic Speech Recognition (ASR) with Deepgram

Deepgram is an AI company specializing in fast, accurate speech-to-text. Their Nova-2 model is specifically optimized for low-latency transcription — meaning it's designed to produce text output as quickly as possible, rather than waiting for a full sentence before outputting.

Deepgram processes the audio stream using a deep neural network. The model has been trained on hundreds of thousands of hours of speech — including gaming content, streams, podcasts, and everyday conversation. It produces a probability-weighted prediction of what was said, outputting words as it becomes confident in them.

This is why you sometimes see the subtitle text "flicker" — the model outputs an initial guess, then updates it as more context arrives. This is called "word finalization" and is a normal part of real-time ASR.

This step adds: ~200–400ms latency.

🤖 How the AI "thinks": The model doesn't know the full sentence when it starts transcribing. It predicts the most likely word given the audio it's heard so far. "I was..." could be followed by "going," "thinking," or hundreds of other words — the model assigns probabilities and picks the most likely one, updating as more audio arrives.

Step 3: Neural Machine Translation

Once a phrase or sentence is transcribed, it's passed to a translation model. Modern NMT (neural machine translation) systems have replaced the older "phrase substitution" approach with transformer-based models that understand context. "I'm going to smoke them" means something very different in a gaming context than a general context — modern NMT handles this better than earlier systems, though not perfectly.

Translation happens in ~100–200ms for common language pairs (English to Spanish, French, Portuguese). Less common pairs (English to Japanese, Korean) may take slightly longer due to structural differences in the target language.

This step adds: ~100–300ms latency.

Step 4 and 5: Display and Composite

The translated text is pushed via WebSocket to your browser source overlay. JavaScript renders it as styled HTML text on a semi-transparent background. OBS composites this layer onto your video output during encoding. This happens in real time and adds ~50–100ms total.

Total Latency Breakdown

Where Your 1.4 Seconds Goes

Audio capture + network~100ms
Deepgram ASR processing~300ms
Neural machine translation~200ms
Text render + OBS composite~100ms
Word finalization buffer~700ms
Total average delay~1.4s

Why It Runs in the Cloud

Running speech recognition and translation on your local PC would require GPU-accelerated hardware most streamers don't have, and would compete with your game for resources. Cloud processing offloads the compute entirely — your PC only handles the lightweight audio capture and text rendering. This is why StreamTranslate has essentially zero CPU impact on your streaming machine.

For more on latency in live streaming specifically, see: latency in live stream captions — what's acceptable?

Frequently Asked Questions

What AI model powers StreamTranslate's speech recognition?

StreamTranslate uses Deepgram's Nova-2 model for speech-to-text, combined with neural machine translation for language conversion. Deepgram is optimized for low-latency real-time transcription.

Does speech recognition run on my computer?

No. StreamTranslate processes audio in the cloud. Your microphone audio is streamed to Deepgram's servers for transcription, then to translation servers, and the resulting text is sent back to your browser source overlay. This means zero CPU impact on your PC.

Can the system recognize gaming-specific vocabulary?

Modern ASR systems have been trained on gaming content and handle common gaming terms well. Very niche game-specific terms, streamer-invented slang, or newly coined phrases may not be recognized. The model improves continuously as more gaming audio is processed.

See the Technology in Action

Experience real-time speech recognition and translation on your own stream. Free trial, no setup required beyond OBS.

Try Free — 60 Seconds to Live Subtitles