Real-time speech-to-text has transformed how we interact with meetings. Whether you're using it for accessibility, note-taking, or AI-powered assistance, understanding how it works will help you choose the right solution. This guide covers the entire pipeline from audio input to text output.

The Audio Input Problem

The first challenge in meeting transcription is getting clean audio. Unlike a podcast or voice memo, meeting audio involves multiple speakers, background noise, echo from speakers playing into microphones, and the compression artifacts introduced by video conferencing codecs.

System Audio Capture

To transcribe what others are saying in a meeting, you need to capture the system audio output — the audio coming from Zoom, Teams, or Meet. This is fundamentally different from microphone capture:

# Simplified WASAPI loopback capture on Windows
import pyaudiowpatch as pyaudio

p = pyaudio.PyAudio()
wasapi_info = p.get_host_api_info_by_type(pyaudio.paWASAPI)

# Find the loopback device
for i in range(wasapi_info["deviceCount"]):
    device = p.get_device_info_by_host_api_device_index(
        wasapi_info["index"], i
    )
    if device.get("isLoopbackDevice"):
        loopback_device = device
        break

stream = p.open(
    format=pyaudio.paInt16,
    channels=loopback_device["maxInputChannels"],
    rate=int(loopback_device["defaultSampleRate"]),
    input=True,
    input_device_index=loopback_device["index"],
    frames_per_buffer=1024
)

Echo Cancellation

When you're in a meeting, your microphone picks up the audio from your speakers, creating an echo in the transcription. Acoustic Echo Cancellation (AEC) algorithms remove this echo by subtracting the known speaker output from the microphone input:

30dBEcho Suppression
<10msProcessing Delay
95%Echo Removal Rate

Streaming Transcription Architecture

Once you have clean audio, the next step is streaming it to a speech recognition service. The standard approach uses WebSockets for bidirectional communication:

// Browser-based WebSocket streaming
const socket = new WebSocket('wss://api.deepgram.com/v1/listen?model=nova-2', [
  'token', 'YOUR_DEEPGRAM_API_KEY'
]);

socket.onopen = () => {
  navigator.mediaDevices.getUserMedia({ audio: true })
    .then(stream => {
      const recorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
      recorder.ondataavailable = (e) => socket.send(e.data);
      recorder.start(250); // Send chunks every 250ms
    });
};

socket.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  const transcript = data.channel?.alternatives[0]?.transcript;
  if (transcript && data.is_final) {
    document.getElementById('captions').textContent += transcript + ' ';
  }
};

Choosing a Transcription Provider

ProviderBest ForLatencyPricing
DeepgramConversational speech~300ms$0.0043/min
Google Cloud STTMulti-language~400ms$0.006/min
AWS TranscribeAWS ecosystem~500ms$0.024/min
AssemblyAISpeaker diarization~350ms$0.0065/min
faster-whisper (local)Privacy~500msFree
Recommendation: For real-time meeting transcription, Deepgram offers the best balance of speed, accuracy, and cost. Voxclar uses Deepgram as its primary cloud ASR provider while offering faster-whisper as a local alternative.

Handling Multiple Speakers

Speaker diarization — identifying who said what — adds another layer of complexity. Cloud providers handle this with models trained on multi-speaker audio. For meeting transcription, accurate diarization is essential for creating useful notes and understanding the flow of conversation.

Building vs. Buying

Building a real-time transcription pipeline from scratch is a significant engineering effort. Between audio capture, echo cancellation, streaming infrastructure, and ASR integration, you're looking at months of development time. For most use cases, a purpose-built tool like Voxclar provides everything you need out of the box, with the added benefit of AI-powered features like answer generation and floating captions.

For a deeper comparison of cloud vs. local ASR, read our dedicated comparison guide. If you're specifically interested in interview scenarios, check out our technical guide to AI interview assistants.