Speech recognition accuracy has improved dramatically year over year, but benchmark numbers can be misleading without context. In this article, we present real-world accuracy benchmarks across leading ASR providers, tested in conditions that matter: varied accents, background noise, technical vocabulary, and multi-speaker scenarios.
Methodology
We tested each provider against five datasets:
- Clean conversational English — Standard American English, quiet environment
- Accented English — Indian, Chinese, British, and Nigerian English accents
- Noisy environment — Cafe, office, and street noise backgrounds
- Technical vocabulary — Software engineering, medical, and legal terminology
- Multi-speaker — 3-4 speakers with overlapping speech
Overall Results
| Provider / Model | Clean | Accented | Noisy | Technical | Multi-Speaker |
|---|---|---|---|---|---|
| Deepgram Nova-2 | 96.8% | 93.2% | 91.5% | 94.1% | 88.3% |
| Google Cloud V2 | 95.4% | 92.8% | 90.1% | 92.7% | 87.1% |
| AWS Transcribe | 94.2% | 91.1% | 88.7% | 91.3% | 85.6% |
| AssemblyAI | 95.9% | 92.5% | 90.4% | 93.0% | 89.2% |
| Whisper large-v3 | 95.1% | 93.7% | 87.2% | 91.8% | 83.4% |
| faster-whisper large-v3 | 95.0% | 93.5% | 87.0% | 91.6% | 83.1% |
Key Findings
1. Cloud Providers Lead in Noisy Environments
Cloud providers have a significant advantage in noisy environments because they train on massive datasets that include ambient noise. Deepgram's Nova-2 and AssemblyAI both handle cafe and office noise exceptionally well.
2. Whisper Excels at Accented Speech
Interestingly, Whisper (and faster-whisper) outperform most cloud providers on accented speech. This is likely due to Whisper's training data, which includes a diverse mix of global English accents.
3. Technical Vocabulary Remains a Challenge
All providers struggle with highly specialized technical terms. Custom vocabulary features (available in Deepgram and Google Cloud) can improve this by 3-5% for specific use cases.
Implications for Interview Transcription
Interview scenarios typically involve clean-to-moderate audio quality with a mix of conversational and technical speech. Based on our benchmarks, Deepgram Nova-2 provides the best overall accuracy for this use case, which is why Voxclar uses it as the primary cloud ASR provider.
Word Error Rate vs. Sentence Accuracy
Industry benchmarks typically report Word Error Rate (WER), but for practical applications, sentence-level accuracy matters more. A WER of 5% might mean every sentence has a small error — or it might mean 95% of sentences are perfect with 5% being completely garbled. We recommend testing with your specific audio conditions before committing to a provider.
The Local ASR Trade-Off
Local ASR (faster-whisper) sacrifices 2-4% accuracy compared to the best cloud providers but gains complete privacy and zero per-minute cost. For users concerned about sending interview audio to cloud servers, Voxclar's local ASR option provides a compelling alternative. See our detailed comparison for more.
"We benchmark our ASR pipeline weekly against the latest models. The accuracy improvements we've seen over the past year are remarkable — what was state-of-the-art in 2025 is now baseline." — Voxclar Audio Engineering
Explore the technology further with our Python speech-to-text tutorial and technical guide to AI interview assistants.
Voxclar