GamiWays Research

STT / Speech-to-Text

Comparison of speech recognition solutions for conversational pipelines (2025–2026). Benchmarks, strategic stakes, and decision questions.

🎯

Strategic Framing — STT: Much More Than a WER

The Real Question

WER doesn't tell the whole story. The real question is: where is your voice data processed? Who has access to it? What is your strategy if the provider raises prices or gets acquired?

Infrastructure Spectrum

Cloud-only (Google, AssemblyAI) → Cloud+VPC (Deepgram, Azure) → On-premise (Azure containers, Inworld) → Self-hosted open-source (Whisper, Voxtral). Each step increases sovereignty and reduces lock-in.

2026 Market Signal

Deepgram ($1.3B) and AssemblyAI ($158M) are acquisition targets. Voxtral and Whisper are closing the quality gap. Inworld STT raised prices 400%+. The market is consolidating fast.

Questions to ask before choosing: Is your users' voice data subject to GDPR or Swiss nLPD? Do you need diarization or PII redaction? Does your architecture allow switching to a self-hosted model without major rework? What is your acceptable WER threshold for your specific domain?

Real-time cloud STT APIs — Deepgram Nova-3 is the latency reference (75ms TTFA). Whisper large-v3 is the open-source quality standard. AssemblyAI Universal-3 Pro leads the multilingual WER benchmark — Voice Agent API $4.50/hr (launched Apr 29, 2026).

Click a header to sort

Solution	TTFA?	WER?	Streaming?	Multilingual	Diarization?	Price/hr	Sovereign?
Deepgram Nova-3 Phase 1 MVP — ASR streaming	75ms	7.2%	✓	36 langs	✓	$0.216/hr	✓
Inworld STT Phase 1 MVP — STT émotionnel + Axe 2 Avatar Behavior	92ms	5%	✓	100 langs	✓	$0.36/hr	✗
AssemblyAI Universal-3 Pro Voice Agent Pipeline — Référence précision	150ms	4.9%	✓	99 langs	✓	$0.372/hr	✗
Azure Speech (Microsoft) Souveraineté suisse — institutionnel	180ms	5.9%	✓	100 langs	✓	$1/hr	✓
Google Speech-to-Text v2 Multilingue — option secondaire	200ms	2.7%	✓	125 langs	✓	$0.36/hr	✗

Deepgram Nova-3

Phase 1 MVP — ASR streaming

Latency75ms

Accuracy7.2% WER

Price access7/10

Deepgram Nova-3 is the industry reference for real-time voice agent ASR. 75ms P90 streaming latency, 36 languages, built-in VAD and endpointing. Voice Agent API at $4.50/hr provides a complete STT+LLM+TTS pipeline in a single WebSocket, with BYO LLM (GPT-4, Claude, Gemini) and function calling for RAG integration. No native voice cloning — Aura-2 TTS offers 36 preset voices; external TTS (ElevenLabs, Cartesia) can be integrated. On-premise deployment available for Enterprise (partial sovereignty). EU endpoint available. Used by Tavus, Simli, and most commercial avatar platforms.

Full details →

Inworld STT

Phase 1 MVP — STT émotionnel + Axe 2 Avatar Behavior

Latency92ms

Accuracy5% WER

Price access7/10

Inworld STT (2025–2026) is the most feature-rich cloud STT API for interactive voice agents. Sub-100ms documented latency, 100+ languages via multi-provider routing (Whisper large-v3 + AssemblyAI). Unique real-time voice profiling extracts emotion (happy/calm/angry/frustrated), accent, age, pitch, and vocal style on every streaming chunk. Realtime API (full pipeline STT+LLM+TTS) from $0.015/min — 4x cheaper than OpenAI Realtime ($0.06/min). Native voice cloning: built-in + cloned + custom voices (up to 3,000 custom voices on Growth plan). RAG via function calling (tool calling mid-conversation). ZDR support. On-premise available on Enterprise. Drop-in compatible with OpenAI Realtime API.

Full details →

AssemblyAI Universal-3 Pro

Voice Agent Pipeline — Référence précision

Latency150ms

Accuracy4.9% WER

Price access5/10

AssemblyAI a lancé sa Voice Agent API le 29 avril 2026 : pipeline complet STT+LLM+TTS en une seule connexion WebSocket à $4.50/hr flat. Universal-3 Pro Streaming (u3-rt-pro) est le modèle STT temps réel, avec turn detection sémantique+acoustique, barge-in natif et session resumption. Tool calling (JSON Schema) permet d'intégrer un RAG custom (Pinecone, LlamaIndex, etc.) via function calling — pas de RAG natif mais intégration externe complète. Pas de clonage vocal natif : les voix disponibles sont prédéfinies (18+ voix EN/multilingual), mais il est possible d'intégrer un TTS externe (ElevenLabs, Cartesia) pour un clone vocal custom. LeMUR AI features (summarization, Q&A, sentiment) disponibles sur transcriptions. 99 langues, diarisation, timestamps mot-à-mot.

Full details →

Azure Speech (Microsoft)

Souveraineté suisse — institutionnel

Latency180ms

Accuracy5.9% WER

Price access3/10

Azure Speech offers enterprise-grade STT with 100+ languages, custom model training, and Swiss data center (Zurich). 5.9% WER on English. Disconnected container deployment for partial sovereignty. Most expensive option but strongest enterprise compliance. Swiss German custom model available.

Full details →

Google Speech-to-Text v2

Multilingue — option secondaire

Latency200ms

Accuracy2.7% WER

Price access6/10

Google Speech-to-Text v2 avec Chirp 3 (GA mai 2026) : 2.7% WER EN (Artificial Analysis), 125+ langues, streaming gRPC bidirectionnel. MedASR open-source lancé fin 2025 (domaine médical). Expansion offline majeure (avril 2026). Pas de clonage vocal natif (via Google TTS séparé). RAG via tool calling + Vertex AI Search. Cloud uniquement, pas d'on-premise.

Full details →