For Windows-based meeting transcription tools, WASAPI (Windows Audio Session API) loopback capture is the standard mechanism for capturing system audio. Unlike microphone capture, loopback recording captures the audio output — everything you hear through your speakers or headphones. Here's how it works at a technical level.
WASAPI Architecture Overview
WASAPI sits between applications and the audio hardware in Windows' audio stack:
Application (Zoom/Teams) → Audio Engine → WASAPI → Audio Hardware
↓
Loopback Capture (your tool)
In shared mode, WASAPI allows multiple applications to share the audio endpoint. Loopback capture taps into the mixed output of all applications playing through a given audio device.
Implementation in Python
Using pyaudiowpatch (a WASAPI-compatible fork of PyAudio):
import pyaudiowpatch as pyaudio
import numpy as np
def find_loopback_device(p: pyaudio.PyAudio):
# Find the default loopback device for WASAPI
wasapi_info = p.get_host_api_info_by_type(pyaudio.paWASAPI)
default_speakers = p.get_device_info_by_index(
wasapi_info["defaultOutputDevice"]
)
for i in range(p.get_device_count()):
device = p.get_device_info_by_index(i)
if (device.get("isLoopbackDevice")
and device["name"].startswith(default_speakers["name"])):
return device
raise RuntimeError("No loopback device found")
def capture_audio():
p = pyaudio.PyAudio()
device = find_loopback_device(p)
stream = p.open(
format=pyaudio.paInt16,
channels=device["maxInputChannels"],
rate=int(device["defaultSampleRate"]),
input=True,
input_device_index=device["index"],
frames_per_buffer=512, # Low latency buffer
)
print(f"Capturing from: {device['name']}")
print(f"Sample rate: {device['defaultSampleRate']} Hz")
print(f"Channels: {device['maxInputChannels']}")
try:
while True:
data = stream.read(512, exception_on_overflow=False)
audio_array = np.frombuffer(data, dtype=np.int16)
# Process or forward the audio...
yield data
finally:
stream.stop_stream()
stream.close()
p.terminate()
Buffer Size and Latency Trade-offs
| Buffer Size | Latency | CPU Usage | Reliability |
|---|---|---|---|
| 128 frames | ~3ms | High | May overflow |
| 512 frames | ~11ms | Moderate | Good |
| 1024 frames | ~23ms | Low | Excellent |
| 4096 frames | ~93ms | Very low | Excellent |
Handling Common Issues
Sample Rate Mismatch
The loopback device's sample rate matches the system's audio output format, which is often 48kHz. If your ASR provider expects 16kHz, you'll need to resample:
import librosa
def resample_audio(audio_data, original_rate=48000, target_rate=16000):
audio_float = audio_data.astype(np.float32) / 32768.0
resampled = librosa.resample(audio_float, orig_sr=original_rate, target_sr=target_rate)
return (resampled * 32768).astype(np.int16)
Channel Downmixing
System audio is often stereo (2 channels), but ASR providers typically expect mono. Downmix by averaging the channels:
def stereo_to_mono(stereo_data):
stereo = np.frombuffer(stereo_data, dtype=np.int16)
left = stereo[0::2]
right = stereo[1::2]
mono = ((left.astype(np.int32) + right.astype(np.int32)) // 2).astype(np.int16)
return mono.tobytes()
Exclusive vs. Shared Mode
WASAPI supports two modes:
- Shared mode: Multiple applications can use the audio device simultaneously. This is what you want for meeting transcription — the user hears the audio normally while your tool captures it.
- Exclusive mode: Your application gets exclusive access to the audio device. Avoid this for meeting tools — it would prevent the user from hearing the meeting audio.
macOS Equivalent: Core Audio Taps
On macOS, the equivalent technology is Core Audio process taps (available since macOS 14). While the API is different, the concept is the same — tapping into the audio output of specific applications without affecting playback. Read more about the cross-platform challenges in our Electron development guide.
"WASAPI loopback is one of those APIs that's simple in concept but tricky in practice. The buffer management and sample rate handling are where most developers get stuck." — Audio Engineer, Voxclar
For more on the complete audio pipeline, see our guide to real-time speech-to-text for meetings and our AI interview assistant technical guide.
Voxclar