Real-time speech-to-text has transformed how we interact with meetings. Whether you're using it for accessibility, note-taking, or AI-powered assistance, understanding how it works will help you choose the right solution. This guide covers the entire pipeline from audio input to text output.

The Audio Input Problem

The first challenge in meeting transcription is getting clean audio. Unlike a podcast or voice memo, meeting audio involves multiple speakers, background noise, echo from speakers playing into microphones, and the compression artifacts introduced by video conferencing codecs.

System Audio Capture

To transcribe what others are saying in a meeting, you need to capture the system audio output — the audio coming from Zoom, Teams, or Meet. This is fundamentally different from microphone capture:

macOS: Audio process taps via Core Audio allow you to intercept audio from specific applications without affecting playback
Windows: WASAPI loopback capture records the mixed audio output of the system
Linux: PulseAudio monitor sources serve a similar purpose

# Simplified WASAPI loopback capture on Windows
import pyaudiowpatch as pyaudio

p = pyaudio.PyAudio()
wasapi_info = p.get_host_api_info_by_type(pyaudio.paWASAPI)

# Find the loopback device
for i in range(wasapi_info["deviceCount"]):
    device = p.get_device_info_by_host_api_device_index(
        wasapi_info["index"], i
    )
    if device.get("isLoopbackDevice"):
        loopback_device = device
        break

stream = p.open(
    format=pyaudio.paInt16,
    channels=loopback_device["maxInputChannels"],
    rate=int(loopback_device["defaultSampleRate"]),
    input=True,
    input_device_index=loopback_device["index"],
    frames_per_buffer=1024
)

Echo Cancellation

When you're in a meeting, your microphone picks up the audio from your speakers, creating an echo in the transcription. Acoustic Echo Cancellation (AEC) algorithms remove this echo by subtracting the known speaker output from the microphone input:

30dBEcho Suppression

<10msProcessing Delay

95%Echo Removal Rate

Streaming Transcription Architecture

Once you have clean audio, the next step is streaming it to a speech recognition service. The standard approach uses WebSockets for bidirectional communication:

// Browser-based WebSocket streaming
const socket = new WebSocket('wss://api.deepgram.com/v1/listen?model=nova-2', [
  'token', 'YOUR_DEEPGRAM_API_KEY'
]);

socket.onopen = () => {
  navigator.mediaDevices.getUserMedia({ audio: true })
    .then(stream => {
      const recorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
      recorder.ondataavailable = (e) => socket.send(e.data);
      recorder.start(250); // Send chunks every 250ms
    });
};

socket.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  const transcript = data.channel?.alternatives[0]?.transcript;
  if (transcript && data.is_final) {
    document.getElementById('captions').textContent += transcript + ' ';
  }
};

Choosing a Transcription Provider

Provider	Best For	Latency	Pricing
Deepgram	Conversational speech	~300ms	$0.0043/min
Google Cloud STT	Multi-language	~400ms	$0.006/min
AWS Transcribe	AWS ecosystem	~500ms	$0.024/min
AssemblyAI	Speaker diarization	~350ms	$0.0065/min
faster-whisper (local)	Privacy	~500ms	Free

Recommendation: For real-time meeting transcription, Deepgram offers the best balance of speed, accuracy, and cost. Voxclar uses Deepgram as its primary cloud ASR provider while offering faster-whisper as a local alternative.

Handling Multiple Speakers

Speaker diarization — identifying who said what — adds another layer of complexity. Cloud providers handle this with models trained on multi-speaker audio. For meeting transcription, accurate diarization is essential for creating useful notes and understanding the flow of conversation.

Building vs. Buying

Building a real-time transcription pipeline from scratch is a significant engineering effort. Between audio capture, echo cancellation, streaming infrastructure, and ASR integration, you're looking at months of development time. For most use cases, a purpose-built tool like Voxclar provides everything you need out of the box, with the added benefit of AI-powered features like answer generation and floating captions.

For a deeper comparison of cloud vs. local ASR, read our dedicated comparison guide. If you're specifically interested in interview scenarios, check out our technical guide to AI interview assistants.

Real-Time Speech to Text for Meetings: The Complete Guide

The Audio Input Problem

System Audio Capture

Echo Cancellation

Streaming Transcription Architecture

Choosing a Transcription Provider

Handling Multiple Speakers

Building vs. Buying

Try Voxclar — Free

The Audio Input Problem

System Audio Capture

Echo Cancellation

Streaming Transcription Architecture

Choosing a Transcription Provider

Handling Multiple Speakers

Building vs. Buying

Try Voxclar — Free

Related Articles

Why Desktop Apps Beat Browser Extensions for Interview Assistance

Speech-to-Text API Comparison for Developers (2026)

Voice Activity Detection for Real-Time Transcription Systems