For Windows-based meeting transcription tools, WASAPI (Windows Audio Session API) loopback capture is the standard mechanism for capturing system audio. Unlike microphone capture, loopback recording captures the audio output — everything you hear through your speakers or headphones. Here's how it works at a technical level.

WASAPI Architecture Overview

WASAPI sits between applications and the audio hardware in Windows' audio stack:

Application (Zoom/Teams) → Audio Engine → WASAPI → Audio Hardware
                                ↓
                    Loopback Capture (your tool)

In shared mode, WASAPI allows multiple applications to share the audio endpoint. Loopback capture taps into the mixed output of all applications playing through a given audio device.

Implementation in Python

Using pyaudiowpatch (a WASAPI-compatible fork of PyAudio):

import pyaudiowpatch as pyaudio
import numpy as np

def find_loopback_device(p: pyaudio.PyAudio):
    # Find the default loopback device for WASAPI
    wasapi_info = p.get_host_api_info_by_type(pyaudio.paWASAPI)

    default_speakers = p.get_device_info_by_index(
        wasapi_info["defaultOutputDevice"]
    )

    for i in range(p.get_device_count()):
        device = p.get_device_info_by_index(i)
        if (device.get("isLoopbackDevice")
            and device["name"].startswith(default_speakers["name"])):
            return device

    raise RuntimeError("No loopback device found")

def capture_audio():
    p = pyaudio.PyAudio()
    device = find_loopback_device(p)

    stream = p.open(
        format=pyaudio.paInt16,
        channels=device["maxInputChannels"],
        rate=int(device["defaultSampleRate"]),
        input=True,
        input_device_index=device["index"],
        frames_per_buffer=512,  # Low latency buffer
    )

    print(f"Capturing from: {device['name']}")
    print(f"Sample rate: {device['defaultSampleRate']} Hz")
    print(f"Channels: {device['maxInputChannels']}")

    try:
        while True:
            data = stream.read(512, exception_on_overflow=False)
            audio_array = np.frombuffer(data, dtype=np.int16)
            # Process or forward the audio...
            yield data
    finally:
        stream.stop_stream()
        stream.close()
        p.terminate()

Buffer Size and Latency Trade-offs

Buffer Size	Latency	CPU Usage	Reliability
128 frames	~3ms	High	May overflow
512 frames	~11ms	Moderate	Good
1024 frames	~23ms	Low	Excellent
4096 frames	~93ms	Very low	Excellent

512Optimal Buffer (frames)

~11msCapture Latency

48kHzTypical Sample Rate

Handling Common Issues

Sample Rate Mismatch

The loopback device's sample rate matches the system's audio output format, which is often 48kHz. If your ASR provider expects 16kHz, you'll need to resample:

import librosa

def resample_audio(audio_data, original_rate=48000, target_rate=16000):
    audio_float = audio_data.astype(np.float32) / 32768.0
    resampled = librosa.resample(audio_float, orig_sr=original_rate, target_sr=target_rate)
    return (resampled * 32768).astype(np.int16)

Channel Downmixing

System audio is often stereo (2 channels), but ASR providers typically expect mono. Downmix by averaging the channels:

def stereo_to_mono(stereo_data):
    stereo = np.frombuffer(stereo_data, dtype=np.int16)
    left = stereo[0::2]
    right = stereo[1::2]
    mono = ((left.astype(np.int32) + right.astype(np.int32)) // 2).astype(np.int16)
    return mono.tobytes()

How Voxclar handles this: Voxclar's Windows audio capture module handles all of these details automatically — device discovery, sample rate conversion, channel downmixing, and buffer management. Users just click "Start" and the audio pipeline handles the rest.

Exclusive vs. Shared Mode

WASAPI supports two modes:

Shared mode: Multiple applications can use the audio device simultaneously. This is what you want for meeting transcription — the user hears the audio normally while your tool captures it.
Exclusive mode: Your application gets exclusive access to the audio device. Avoid this for meeting tools — it would prevent the user from hearing the meeting audio.

macOS Equivalent: Core Audio Taps

On macOS, the equivalent technology is Core Audio process taps (available since macOS 14). While the API is different, the concept is the same — tapping into the audio output of specific applications without affecting playback. Read more about the cross-platform challenges in our Electron development guide.

"WASAPI loopback is one of those APIs that's simple in concept but tricky in practice. The buffer management and sample rate handling are where most developers get stuck." — Audio Engineer, Voxclar

For more on the complete audio pipeline, see our guide to real-time speech-to-text for meetings and our AI interview assistant technical guide.

WASAPI Audio Capture for Meetings: A Technical Explainer

WASAPI Architecture Overview

Implementation in Python

Buffer Size and Latency Trade-offs

Handling Common Issues

Sample Rate Mismatch

Channel Downmixing

Exclusive vs. Shared Mode

macOS Equivalent: Core Audio Taps

Try Voxclar — Free

WASAPI Architecture Overview

Implementation in Python

Buffer Size and Latency Trade-offs

Handling Common Issues

Sample Rate Mismatch

Channel Downmixing

Exclusive vs. Shared Mode

macOS Equivalent: Core Audio Taps

Try Voxclar — Free

Related Articles

Why Desktop Apps Beat Browser Extensions for Interview Assistance

Speech-to-Text API Comparison for Developers (2026)

Voice Activity Detection for Real-Time Transcription Systems