STT Testing

Overview

Destined Voice includes tools to evaluate Speech-to-Text (STT) providers. Test multiple providers against your audio and analyze accuracy with WER/CER metrics.

Supported Providers

Provider	Model	Description
Deepgram Nova-3	`deepgram-nova-3`	Latest Deepgram model
Deepgram Flux	`deepgram-flux`	Real-time optimized
AssemblyAI	`assemblyai`	High accuracy
OpenAI Whisper	`openai`	Multilingual
Google Speech	`google`	Google Cloud STT
Azure Speech	`azure`	Microsoft Azure
Amazon Transcribe	`amazon`	AWS Transcribe
Soniox	`soniox`	Low latency
Play.ht	`playht`	Specialized model

Transcribing Audio

Send audio to multiple providers:

const results = await client.sttTesting.transcribeV1SttTranscribePost({
  audioUrl: "https://example.com/audio.wav",
  providers: ["deepgram-nova-3", "assemblyai", "openai"],
});

console.log(results);
// {
//   "deepgram-nova-3": {
//     transcript: "Hello, this is a test.",
//     latency_ms: 450,
//     confidence: 0.98
//   },
//   "assemblyai": {
//     transcript: "Hello, this is a test.",
//     latency_ms: 620,
//     confidence: 0.97
//   },
//   ...
// }

Calculating Accuracy

Compare transcriptions against ground truth:

const accuracy = await client.sttTesting.calculateAccuracyV1SttCalculateAccuracyPost({
  reference: "Hello, this is a test.",
  hypothesis: "Hello, this is the test.",
});

console.log(accuracy);
// {
//   wer: 0.20,        // Word Error Rate (20%)
//   cer: 0.05,        // Character Error Rate (5%)
//   substitutions: 1,
//   insertions: 0,
//   deletions: 0
// }

Metrics Explained

Word Error Rate (WER)

Measures word-level accuracy:

WER = (Substitutions + Insertions + Deletions) / Total Reference Words

Lower is better (0.0 = perfect, 1.0 = completely wrong)
Industry standard for STT evaluation

Character Error Rate (CER)

Measures character-level accuracy:

CER = (Substitutions + Insertions + Deletions) / Total Reference Characters

More granular than WER
Useful for detecting minor transcription errors

Demographic Bias Analysis

Analyze STT accuracy across demographics:

// Generate test audio with different speakers
const speakers = await client.speakers.listSpeakersV1SpeakersGet({
  limit: 100,
});

// Group by demographic and compare WER
// Enterprise feature - contact sales

Best Practices

Use consistent audio quality

Test with audio at 16kHz or higher. Lower quality affects all providers equally.

Normalize text before comparison

Remove punctuation, lowercase text, and expand numbers for fair WER calculation.

Test multiple accents

STT accuracy varies by accent. Test with speakers matching your user base.

Consider latency vs accuracy

Some providers trade accuracy for speed. Choose based on your use case.

Provider Comparison (Typical Performance)

Provider	Avg WER	Avg Latency	Best For
Deepgram Nova-3	5-8%	400ms	General use
AssemblyAI	4-7%	600ms	High accuracy
OpenAI Whisper	5-10%	800ms	Multilingual
Google	6-10%	500ms	Integration
Azure	6-10%	550ms	Enterprise

Actual performance varies by audio quality, accent, and domain vocabulary.

Get Started

Core Concepts

SDKs

Overview

Supported Providers

Transcribing Audio

Calculating Accuracy

Metrics Explained

Word Error Rate (WER)

Character Error Rate (CER)

Demographic Bias Analysis

Best Practices

Provider Comparison (Typical Performance)

Get Started

Core Concepts

SDKs

​Overview

​Supported Providers

​Transcribing Audio

​Calculating Accuracy

​Metrics Explained

​Word Error Rate (WER)

​Character Error Rate (CER)

​Demographic Bias Analysis

​Best Practices

​Provider Comparison (Typical Performance)

Overview

Supported Providers

Transcribing Audio

Calculating Accuracy

Metrics Explained

Word Error Rate (WER)

Character Error Rate (CER)

Demographic Bias Analysis

Best Practices

Provider Comparison (Typical Performance)