Real-time speech-to-text has transformed how we interact with meetings. Whether you're using it for accessibility, note-taking, or AI-powered assistance, understanding how it works will help you choose the right solution. This guide covers the entire pipeline from audio input to text output.
The Audio Input Problem
The first challenge in meeting transcription is getting clean audio. Unlike a podcast or voice memo, meeting audio involves multiple speakers, background noise, echo from speakers playing into microphones, and the compression artifacts introduced by video conferencing codecs.
System Audio Capture
To transcribe what others are saying in a meeting, you need to capture the system audio output — the audio coming from Zoom, Teams, or Meet. This is fundamentally different from microphone capture:
- macOS: Audio process taps via Core Audio allow you to intercept audio from specific applications without affecting playback
- Windows: WASAPI loopback capture records the mixed audio output of the system
- Linux: PulseAudio monitor sources serve a similar purpose
# Simplified WASAPI loopback capture on Windows
import pyaudiowpatch as pyaudio
p = pyaudio.PyAudio()
wasapi_info = p.get_host_api_info_by_type(pyaudio.paWASAPI)
# Find the loopback device
for i in range(wasapi_info["deviceCount"]):
device = p.get_device_info_by_host_api_device_index(
wasapi_info["index"], i
)
if device.get("isLoopbackDevice"):
loopback_device = device
break
stream = p.open(
format=pyaudio.paInt16,
channels=loopback_device["maxInputChannels"],
rate=int(loopback_device["defaultSampleRate"]),
input=True,
input_device_index=loopback_device["index"],
frames_per_buffer=1024
)
Echo Cancellation
When you're in a meeting, your microphone picks up the audio from your speakers, creating an echo in the transcription. Acoustic Echo Cancellation (AEC) algorithms remove this echo by subtracting the known speaker output from the microphone input:
Streaming Transcription Architecture
Once you have clean audio, the next step is streaming it to a speech recognition service. The standard approach uses WebSockets for bidirectional communication:
// Browser-based WebSocket streaming
const socket = new WebSocket('wss://api.deepgram.com/v1/listen?model=nova-2', [
'token', 'YOUR_DEEPGRAM_API_KEY'
]);
socket.onopen = () => {
navigator.mediaDevices.getUserMedia({ audio: true })
.then(stream => {
const recorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
recorder.ondataavailable = (e) => socket.send(e.data);
recorder.start(250); // Send chunks every 250ms
});
};
socket.onmessage = (msg) => {
const data = JSON.parse(msg.data);
const transcript = data.channel?.alternatives[0]?.transcript;
if (transcript && data.is_final) {
document.getElementById('captions').textContent += transcript + ' ';
}
};
Choosing a Transcription Provider
| Provider | Best For | Latency | Pricing |
|---|---|---|---|
| Deepgram | Conversational speech | ~300ms | $0.0043/min |
| Google Cloud STT | Multi-language | ~400ms | $0.006/min |
| AWS Transcribe | AWS ecosystem | ~500ms | $0.024/min |
| AssemblyAI | Speaker diarization | ~350ms | $0.0065/min |
| faster-whisper (local) | Privacy | ~500ms | Free |
Handling Multiple Speakers
Speaker diarization — identifying who said what — adds another layer of complexity. Cloud providers handle this with models trained on multi-speaker audio. For meeting transcription, accurate diarization is essential for creating useful notes and understanding the flow of conversation.
Building vs. Buying
Building a real-time transcription pipeline from scratch is a significant engineering effort. Between audio capture, echo cancellation, streaming infrastructure, and ASR integration, you're looking at months of development time. For most use cases, a purpose-built tool like Voxclar provides everything you need out of the box, with the added benefit of AI-powered features like answer generation and floating captions.
For a deeper comparison of cloud vs. local ASR, read our dedicated comparison guide. If you're specifically interested in interview scenarios, check out our technical guide to AI interview assistants.
Voxclar