Skip to main content

Overview

Destined Voice includes tools to evaluate Speech-to-Text (STT) providers. Test multiple providers against your audio and analyze accuracy with WER/CER metrics.

Supported Providers

ProviderModelDescription
Deepgram Nova-3deepgram-nova-3Latest Deepgram model
Deepgram Fluxdeepgram-fluxReal-time optimized
AssemblyAIassemblyaiHigh accuracy
OpenAI WhisperopenaiMultilingual
Google SpeechgoogleGoogle Cloud STT
Azure SpeechazureMicrosoft Azure
Amazon TranscribeamazonAWS Transcribe
SonioxsonioxLow latency
Play.htplayhtSpecialized model

Transcribing Audio

Send audio to multiple providers:
const results = await client.sttTesting.transcribeV1SttTranscribePost({
  audioUrl: "https://example.com/audio.wav",
  providers: ["deepgram-nova-3", "assemblyai", "openai"],
});

console.log(results);
// {
//   "deepgram-nova-3": {
//     transcript: "Hello, this is a test.",
//     latency_ms: 450,
//     confidence: 0.98
//   },
//   "assemblyai": {
//     transcript: "Hello, this is a test.",
//     latency_ms: 620,
//     confidence: 0.97
//   },
//   ...
// }

Calculating Accuracy

Compare transcriptions against ground truth:
const accuracy = await client.sttTesting.calculateAccuracyV1SttCalculateAccuracyPost({
  reference: "Hello, this is a test.",
  hypothesis: "Hello, this is the test.",
});

console.log(accuracy);
// {
//   wer: 0.20,        // Word Error Rate (20%)
//   cer: 0.05,        // Character Error Rate (5%)
//   substitutions: 1,
//   insertions: 0,
//   deletions: 0
// }

Metrics Explained

Word Error Rate (WER)

Measures word-level accuracy:
WER = (Substitutions + Insertions + Deletions) / Total Reference Words
  • Lower is better (0.0 = perfect, 1.0 = completely wrong)
  • Industry standard for STT evaluation

Character Error Rate (CER)

Measures character-level accuracy:
CER = (Substitutions + Insertions + Deletions) / Total Reference Characters
  • More granular than WER
  • Useful for detecting minor transcription errors

Demographic Bias Analysis

Analyze STT accuracy across demographics:
// Generate test audio with different speakers
const speakers = await client.speakers.listSpeakersV1SpeakersGet({
  limit: 100,
});

// Group by demographic and compare WER
// Enterprise feature - contact sales

Best Practices

Test with audio at 16kHz or higher. Lower quality affects all providers equally.
Remove punctuation, lowercase text, and expand numbers for fair WER calculation.
STT accuracy varies by accent. Test with speakers matching your user base.
Some providers trade accuracy for speed. Choose based on your use case.

Provider Comparison (Typical Performance)

ProviderAvg WERAvg LatencyBest For
Deepgram Nova-35-8%400msGeneral use
AssemblyAI4-7%600msHigh accuracy
OpenAI Whisper5-10%800msMultilingual
Google6-10%500msIntegration
Azure6-10%550msEnterprise
Actual performance varies by audio quality, accent, and domain vocabulary.