Building a real-time speech-to-text application in Python is more accessible than ever, thanks to mature libraries and APIs. In this tutorial, we'll build a working real-time transcription system using both cloud (Deepgram) and local (faster-whisper) approaches.
Prerequisites
- Python 3.10 or later
- A Deepgram API key (free tier available at deepgram.com)
- For local ASR: CUDA-capable GPU (optional but recommended)
pip install deepgram-sdk pyaudio websockets faster-whisper numpy
Part 1: Cloud ASR with Deepgram
Deepgram's streaming API uses WebSockets to receive audio and return transcripts in real time. Here's a complete working example:
import asyncio
import pyaudio
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
DEEPGRAM_API_KEY = "your-api-key-here"
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 4096
async def main():
deepgram = DeepgramClient(DEEPGRAM_API_KEY)
connection = deepgram.listen.asynclive.v("1")
async def on_message(self, result, **kwargs):
transcript = result.channel.alternatives[0].transcript
if transcript:
print(f"Transcript: {transcript}")
connection.on(LiveTranscriptionEvents.Transcript, on_message)
options = LiveOptions(
model="nova-2",
language="en",
smart_format=True,
encoding="linear16",
channels=CHANNELS,
sample_rate=RATE,
)
await connection.start(options)
# Stream microphone audio
audio = pyaudio.PyAudio()
stream = audio.open(
format=FORMAT, channels=CHANNELS,
rate=RATE, input=True,
frames_per_buffer=CHUNK
)
print("Listening... Press Ctrl+C to stop.")
try:
while True:
data = stream.read(CHUNK, exception_on_overflow=False)
await connection.send(data)
await asyncio.sleep(0.01)
except KeyboardInterrupt:
pass
finally:
await connection.finish()
stream.stop_stream()
stream.close()
audio.terminate()
asyncio.run(main())
Part 2: Local ASR with faster-whisper
For privacy-sensitive applications or offline use, faster-whisper provides excellent local transcription:
import numpy as np
import pyaudio
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
# For GPU: WhisperModel("large-v3", device="cuda", compute_type="float16")
FORMAT = pyaudio.paFloat32
CHANNELS = 1
RATE = 16000
CHUNK = RATE * 3 # 3-second chunks
audio = pyaudio.PyAudio()
stream = audio.open(
format=FORMAT, channels=CHANNELS,
rate=RATE, input=True,
frames_per_buffer=CHUNK
)
print("Listening... Press Ctrl+C to stop.")
try:
while True:
data = stream.read(CHUNK, exception_on_overflow=False)
audio_np = np.frombuffer(data, dtype=np.float32)
segments, info = model.transcribe(audio_np, beam_size=5)
for segment in segments:
print(f"[{segment.start:.1f}s-{segment.end:.1f}s] {segment.text}")
except KeyboardInterrupt:
pass
finally:
stream.stop_stream()
stream.close()
audio.terminate()
Part 3: Adding WebSocket Output
To share transcriptions with a frontend or other services, add a WebSocket server:
import asyncio
import websockets
import json
connected_clients = set()
async def handler(websocket):
connected_clients.add(websocket)
try:
async for _ in websocket:
pass
finally:
connected_clients.discard(websocket)
async def broadcast(transcript: str):
if connected_clients:
message = json.dumps({"transcript": transcript, "is_final": True})
await asyncio.gather(
*[client.send(message) for client in connected_clients]
)
async def start_server():
async with websockets.serve(handler, "localhost", 8765):
await asyncio.Future() # Run forever
Performance Comparison
| Metric | Deepgram (Cloud) | faster-whisper (Local CPU) | faster-whisper (Local GPU) |
|---|---|---|---|
| Latency | 200-400ms | 1-3s per chunk | 200-500ms |
| Accuracy (English) | 95-97% | 90-94% | 93-96% |
| Memory usage | Minimal | 1-2GB | 2-6GB VRAM |
| Cost | $0.0043/min | Free | Free |
Next Steps
Once you have basic transcription working, you can extend it with:
- Speaker diarization to identify who is speaking
- Keyword detection to trigger actions on specific phrases
- Integration with LLMs for intelligent response generation
- A floating overlay window for displaying captions
For a deeper understanding of the full pipeline, read our technical guide to how AI interview assistants work and our cloud vs local ASR comparison.
Voxclar