Skip to main content

Overview

RCLI provides comprehensive benchmarking tools to measure:
  • STT: Transcription latency and word error rate (WER)
  • LLM: Token generation speed, TTFT, context usage
  • TTS: Synthesis time and real-time factor
  • E2E: End-to-end pipeline latency
  • RAG: Embedding, retrieval, and query latency
  • Memory: RAM usage across subsystems

Simple Benchmark

rcli_benchmark

Run N iterations of the full pipeline on a test WAV file.
int rcli_benchmark(
    RCLIHandle handle,
    const char* test_wav,
    int iterations,
    RCLIEventCallback callback,
    void* user_data
);
handle
RCLIHandle
required
Engine handle (must be initialized)
test_wav
const char*
required
Path to test WAV file (16kHz mono recommended)
iterations
int
required
Number of benchmark runs (3-10 recommended for stable averages)
callback
RCLIEventCallback
required
Callback for progress and results. Can be NULL to skip callbacks.Events fired:
  • "benchmark_progress": Progress update (e.g., "3/10")
  • "benchmark_run": Single run result (JSON)
  • "benchmark_result": Aggregate results (JSON)
user_data
void*
User data passed to callback
return
int
  • 0: Benchmark completed successfully
  • Non-zero: Failed

Example

void on_benchmark_event(const char* event, const char* data, void* user_data) {
    if (strcmp(event, "benchmark_progress") == 0) {
        printf("\rProgress: %s", data);
        fflush(stdout);
    } else if (strcmp(event, "benchmark_run") == 0) {
        printf("\n%s", data);
    } else if (strcmp(event, "benchmark_result") == 0) {
        printf("\n\nFinal Results:\n%s\n", data);
    }
}

rcli_benchmark(
    handle,
    "/path/to/test.wav",
    5,  // 5 iterations
    on_benchmark_event,
    NULL
);

Output Example

// Per-run ("benchmark_run")
{"run":1,"stt_ms":234.5,"llm_ttft_ms":89.2,"llm_total_ms":456.7,"tts_first_ms":123.4,"e2e_ms":678.9,"total_ms":901.2}

// Aggregate ("benchmark_result")
{
  "iterations": 5,
  "stt_ms": {"min": 230.1, "avg": 235.6, "max": 241.3},
  "llm_ttft_ms": {"min": 85.4, "avg": 89.7, "max": 94.2},
  "llm_total_ms": {"min": 450.2, "avg": 458.3, "max": 467.1},
  "tts_first_ms": {"min": 120.5, "avg": 124.8, "max": 129.3},
  "e2e_ms": {"min": 670.3, "avg": 680.5, "max": 690.8},
  "total_ms": {"min": 895.7, "avg": 905.3, "max": 915.9}
}

Comprehensive Benchmark Suite

rcli_run_full_benchmark

Run comprehensive benchmarks across all subsystems.
int rcli_run_full_benchmark(
    RCLIHandle handle,
    const char* suite,
    int runs,
    const char* output_json
);
handle
RCLIHandle
required
Engine handle (must be initialized)
suite
const char*
required
Benchmark suite to run:
  • "all": Run all benchmarks
  • "stt": STT latency + WER accuracy
  • "llm": LLM generation + tool calling
  • "tts": TTS synthesis + RTF
  • "e2e": End-to-end pipeline
  • "tools" or "actions": Action info
  • "rag": RAG retrieval + query
  • "memory": RAM usage
Comma-separated: "stt,llm,tts"
runs
int
required
Number of measured runs per test (3 is typical)
output_json
const char*
Optional path to save JSON results. Pass NULL to skip export.
return
int
  • 0: Success
  • Non-zero: Failed

Example: Full Benchmark

// Run all benchmarks, 3 runs each, save to file
rcli_run_full_benchmark(
    handle,
    "all",
    3,
    "/tmp/benchmark_results.json"
);

Example: Selective Benchmarks

// Only STT and LLM
rcli_run_full_benchmark(handle, "stt,llm", 5, NULL);

// Only E2E pipeline
rcli_run_full_benchmark(handle, "e2e", 10, "/tmp/e2e_results.json");

Benchmark Categories

STT Benchmark

Measures:
  • Latency: Time to transcribe audio
  • WER: Word error rate across sample utterances
Sample categories:
  • Short commands (“Open Safari”)
  • Questions (“What’s the weather?”)
  • Long commands (multi-sentence)
  • Factual queries
  • Multi-action commands
┌─ STT Benchmark ─┐
│ Latency: 234ms  │
│                  │
│ WER Accuracy:    │
│ short_command:  0% │
│ question:       2% │
│ long_command:   5% │
└──────────────────┘

LLM Benchmark

Measures:
  • TTFT: Time to first token (prompt processing)
  • Token/s: Generation throughput
  • Context usage: Prompt tokens vs. context window
  • Tool calling: Accuracy and latency
┌─ LLM Benchmark ─┐
│ TTFT:      89ms  │
│ Tok/s:     42.3  │
│ Context:   512/4096 (12%) │
│ Tool calls: 98% accuracy  │
└──────────────────┘

TTS Benchmark

Measures:
  • Synthesis time: Time to generate audio
  • RTF: Real-time factor (< 1.0 is faster than real-time)
  • Samples generated: Output audio length
┌─ TTS Benchmark ─┐
│ Synthesis: 123ms │
│ RTF:       0.45  │
│ Samples:   22050 │
│ (1 second audio)  │
└──────────────────┘

E2E Pipeline Benchmark

Measures:
  • E2E latency: Speech input → first audio output
  • Total latency: Complete pipeline (STT → LLM → TTS)
  • Long-form: Multi-sentence responses
┌─ E2E Benchmark ─┐
│ E2E:     678ms  │
│ Total:   901ms  │
│                  │
│ Breakdown:       │
│   STT:    234ms  │
│   LLM:    457ms  │
│   TTS:    210ms  │
└──────────────────┘

RAG Benchmark

Measures:
  • Embedding latency: Query → vector
  • Retrieval latency: Vector + BM25 search
  • Full RAG query: Embedding + retrieval + LLM
┌─ RAG Benchmark ─┐
│ Embedding:  5.2ms │
│ Retrieval:  4.1ms │
│ Full query: 510ms │
│ (5 results)       │
└──────────────────┘
RAG benchmark only runs if an index is loaded via rcli_rag_load_index().

Memory Benchmark

Measures:
  • LLM: Model + KV cache
  • Embedding: RAG embedding model
  • STT: Zipformer + Whisper
  • TTS: Piper/Kokoro
  • Total: Peak RAM usage
┌─ Memory Usage ─┐
│ LLM:       512 MB │
│ Embedding: 128 MB │
│ STT:        64 MB │
│ TTS:        96 MB │
│ Total:     800 MB │
└──────────────────┘

JSON Export Format

{
  "timestamp": "2025-03-07T14:23:45Z",
  "device": {
    "model": "MacBookPro18,1",
    "chip": "Apple M1 Max",
    "memory_gb": 64
  },
  "models": {
    "llm": "Qwen3 0.6B Q4_K_M",
    "stt": "Whisper base.en",
    "tts": "Piper Lessac"
  },
  "results": {
    "stt": {
      "latency_ms": 234.5,
      "wer_avg": 2.3
    },
    "llm": {
      "ttft_ms": 89.2,
      "tok_per_sec": 42.3,
      "context_usage": 0.12
    },
    "tts": {
      "synthesis_ms": 123.4,
      "rtf": 0.45
    },
    "e2e": {
      "latency_ms": 678.9,
      "total_ms": 901.2
    },
    "rag": {
      "embedding_ms": 5.2,
      "retrieval_ms": 4.1,
      "full_query_ms": 510.3
    },
    "memory": {
      "llm_mb": 512,
      "total_mb": 800
    }
  }
}

Complete Example: Benchmark Runner

#include "api/rcli_api.h"
#include <stdio.h>
#include <time.h>

int main() {
    RCLIHandle handle = rcli_create(NULL);
    
    if (rcli_init(handle, "/path/to/models", 99) != 0) {
        fprintf(stderr, "Initialization failed\n");
        return 1;
    }

    // Optional: Load RAG index for RAG benchmarks
    rcli_rag_load_index(handle, "/path/to/rag_index");

    // Generate timestamped output file
    time_t now = time(NULL);
    struct tm* tm_info = localtime(&now);
    char filename[256];
    strftime(filename, sizeof(filename), "benchmark_%Y%m%d_%H%M%S.json", tm_info);

    printf("Running comprehensive benchmark suite...\n\n");
    
    int result = rcli_run_full_benchmark(
        handle,
        "all",     // All benchmarks
        3,         // 3 runs each
        filename   // Save results
    );

    if (result == 0) {
        printf("\n\nBenchmark complete! Results saved to: %s\n", filename);
    } else {
        fprintf(stderr, "Benchmark failed\n");
    }

    rcli_destroy(handle);
    return result;
}
Compile and run:
clang -o bench bench.c -L./build -lrcli
./bench

# Output:
# Running comprehensive benchmark suite...
#
# ┌─ STT Benchmark ─┐
# ...
# Benchmark complete! Results saved to: benchmark_20250307_142345.json

Performance Targets (M1/M2/M3)

MetricTargetGoodExcellent
STT latency< 300ms< 200ms< 150ms
LLM TTFT< 150ms< 100ms< 80ms
LLM tok/s> 30> 40> 50
TTS RTF< 1.0< 0.5< 0.3
E2E latency< 800ms< 600ms< 500ms
RAG retrieval< 10ms< 5ms< 3ms

See Also

  • State Management - Query performance metrics
  • RAG - RAG system details
  • RCLI CLI: rcli bench for interactive benchmarks

Build docs developers (and LLMs) love