Building a real-time speech-to-text application in Python is more accessible than ever, thanks to mature libraries and APIs. In this tutorial, we'll build a working real-time transcription system using both cloud (Deepgram) and local (faster-whisper) approaches.

Prerequisites

pip install deepgram-sdk pyaudio websockets faster-whisper numpy

Part 1: Cloud ASR with Deepgram

Deepgram's streaming API uses WebSockets to receive audio and return transcripts in real time. Here's a complete working example:

import asyncio
import pyaudio
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

DEEPGRAM_API_KEY = "your-api-key-here"
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 4096

async def main():
    deepgram = DeepgramClient(DEEPGRAM_API_KEY)
    connection = deepgram.listen.asynclive.v("1")

    async def on_message(self, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"Transcript: {transcript}")

    connection.on(LiveTranscriptionEvents.Transcript, on_message)

    options = LiveOptions(
        model="nova-2",
        language="en",
        smart_format=True,
        encoding="linear16",
        channels=CHANNELS,
        sample_rate=RATE,
    )

    await connection.start(options)

    # Stream microphone audio
    audio = pyaudio.PyAudio()
    stream = audio.open(
        format=FORMAT, channels=CHANNELS,
        rate=RATE, input=True,
        frames_per_buffer=CHUNK
    )

    print("Listening... Press Ctrl+C to stop.")
    try:
        while True:
            data = stream.read(CHUNK, exception_on_overflow=False)
            await connection.send(data)
            await asyncio.sleep(0.01)
    except KeyboardInterrupt:
        pass
    finally:
        await connection.finish()
        stream.stop_stream()
        stream.close()
        audio.terminate()

asyncio.run(main())

Part 2: Local ASR with faster-whisper

For privacy-sensitive applications or offline use, faster-whisper provides excellent local transcription:

import numpy as np
import pyaudio
from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")
# For GPU: WhisperModel("large-v3", device="cuda", compute_type="float16")

FORMAT = pyaudio.paFloat32
CHANNELS = 1
RATE = 16000
CHUNK = RATE * 3  # 3-second chunks

audio = pyaudio.PyAudio()
stream = audio.open(
    format=FORMAT, channels=CHANNELS,
    rate=RATE, input=True,
    frames_per_buffer=CHUNK
)

print("Listening... Press Ctrl+C to stop.")
try:
    while True:
        data = stream.read(CHUNK, exception_on_overflow=False)
        audio_np = np.frombuffer(data, dtype=np.float32)

        segments, info = model.transcribe(audio_np, beam_size=5)
        for segment in segments:
            print(f"[{segment.start:.1f}s-{segment.end:.1f}s] {segment.text}")
except KeyboardInterrupt:
    pass
finally:
    stream.stop_stream()
    stream.close()
    audio.terminate()
~50Lines of Code
2ASR Engines
<1sTranscription Latency

Part 3: Adding WebSocket Output

To share transcriptions with a frontend or other services, add a WebSocket server:

import asyncio
import websockets
import json

connected_clients = set()

async def handler(websocket):
    connected_clients.add(websocket)
    try:
        async for _ in websocket:
            pass
    finally:
        connected_clients.discard(websocket)

async def broadcast(transcript: str):
    if connected_clients:
        message = json.dumps({"transcript": transcript, "is_final": True})
        await asyncio.gather(
            *[client.send(message) for client in connected_clients]
        )

async def start_server():
    async with websockets.serve(handler, "localhost", 8765):
        await asyncio.Future()  # Run forever

Performance Comparison

MetricDeepgram (Cloud)faster-whisper (Local CPU)faster-whisper (Local GPU)
Latency200-400ms1-3s per chunk200-500ms
Accuracy (English)95-97%90-94%93-96%
Memory usageMinimal1-2GB2-6GB VRAM
Cost$0.0043/minFreeFree
Production tip: In a production application like Voxclar, you'd add buffering, voice activity detection, and error recovery. The examples above are simplified for learning. Check out our complete guide to real-time speech-to-text for meetings for production considerations.

Next Steps

Once you have basic transcription working, you can extend it with:

  1. Speaker diarization to identify who is speaking
  2. Keyword detection to trigger actions on specific phrases
  3. Integration with LLMs for intelligent response generation
  4. A floating overlay window for displaying captions

For a deeper understanding of the full pipeline, read our technical guide to how AI interview assistants work and our cloud vs local ASR comparison.