Benchmarks

Overview

RCLI provides comprehensive benchmarking tools to measure:

STT: Transcription latency and word error rate (WER)
LLM: Token generation speed, TTFT, context usage
TTS: Synthesis time and real-time factor
E2E: End-to-end pipeline latency
RAG: Embedding, retrieval, and query latency
Memory: RAM usage across subsystems

Simple Benchmark

rcli_benchmark

Run N iterations of the full pipeline on a test WAV file.

int rcli_benchmark(
    RCLIHandle handle,
    const char* test_wav,
    int iterations,
    RCLIEventCallback callback,
    void* user_data
);

handle

RCLIHandle

required

Engine handle (must be initialized)

test_wav

const char*

required

Path to test WAV file (16kHz mono recommended)

iterations

int

required

Number of benchmark runs (3-10 recommended for stable averages)

callback

RCLIEventCallback

required

Callback for progress and results. Can be NULL to skip callbacks.Events fired:

"benchmark_progress": Progress update (e.g., "3/10")
"benchmark_run": Single run result (JSON)
"benchmark_result": Aggregate results (JSON)

user_data

void*

User data passed to callback

return

int

0: Benchmark completed successfully
Non-zero: Failed

Example

void on_benchmark_event(const char* event, const char* data, void* user_data) {
    if (strcmp(event, "benchmark_progress") == 0) {
        printf("\rProgress: %s", data);
        fflush(stdout);
    } else if (strcmp(event, "benchmark_run") == 0) {
        printf("\n%s", data);
    } else if (strcmp(event, "benchmark_result") == 0) {
        printf("\n\nFinal Results:\n%s\n", data);
    }
}

rcli_benchmark(
    handle,
    "/path/to/test.wav",
    5,  // 5 iterations
    on_benchmark_event,
    NULL
);

Output Example

// Per-run ("benchmark_run")
{"run":1,"stt_ms":234.5,"llm_ttft_ms":89.2,"llm_total_ms":456.7,"tts_first_ms":123.4,"e2e_ms":678.9,"total_ms":901.2}

// Aggregate ("benchmark_result")
{
  "iterations": 5,
  "stt_ms": {"min": 230.1, "avg": 235.6, "max": 241.3},
  "llm_ttft_ms": {"min": 85.4, "avg": 89.7, "max": 94.2},
  "llm_total_ms": {"min": 450.2, "avg": 458.3, "max": 467.1},
  "tts_first_ms": {"min": 120.5, "avg": 124.8, "max": 129.3},
  "e2e_ms": {"min": 670.3, "avg": 680.5, "max": 690.8},
  "total_ms": {"min": 895.7, "avg": 905.3, "max": 915.9}
}

Comprehensive Benchmark Suite

rcli_run_full_benchmark

Run comprehensive benchmarks across all subsystems.

int rcli_run_full_benchmark(
    RCLIHandle handle,
    const char* suite,
    int runs,
    const char* output_json
);

handle

RCLIHandle

required

Engine handle (must be initialized)

suite

const char*

required

Benchmark suite to run:

"all": Run all benchmarks
"stt": STT latency + WER accuracy
"llm": LLM generation + tool calling
"tts": TTS synthesis + RTF
"e2e": End-to-end pipeline
"tools" or "actions": Action info
"rag": RAG retrieval + query
"memory": RAM usage

Comma-separated: "stt,llm,tts"

runs

int

required

Number of measured runs per test (3 is typical)

output_json

const char*

Optional path to save JSON results. Pass NULL to skip export.

return

int

0: Success
Non-zero: Failed

Example: Full Benchmark

// Run all benchmarks, 3 runs each, save to file
rcli_run_full_benchmark(
    handle,
    "all",
    3,
    "/tmp/benchmark_results.json"
);

Example: Selective Benchmarks

// Only STT and LLM
rcli_run_full_benchmark(handle, "stt,llm", 5, NULL);

// Only E2E pipeline
rcli_run_full_benchmark(handle, "e2e", 10, "/tmp/e2e_results.json");

Benchmark Categories

STT Benchmark

Measures:

Latency: Time to transcribe audio
WER: Word error rate across sample utterances

Sample categories:

Short commands (“Open Safari”)
Questions (“What’s the weather?”)
Long commands (multi-sentence)
Factual queries
Multi-action commands

┌─ STT Benchmark ─┐
│ Latency: 234ms  │
│                  │
│ WER Accuracy:    │
│ short_command:  0% │
│ question:       2% │
│ long_command:   5% │
└──────────────────┘

LLM Benchmark

Measures:

TTFT: Time to first token (prompt processing)
Token/s: Generation throughput
Context usage: Prompt tokens vs. context window
Tool calling: Accuracy and latency

┌─ LLM Benchmark ─┐
│ TTFT:      89ms  │
│ Tok/s:     42.3  │
│ Context:   512/4096 (12%) │
│ Tool calls: 98% accuracy  │
└──────────────────┘

TTS Benchmark

Measures:

Synthesis time: Time to generate audio
RTF: Real-time factor (< 1.0 is faster than real-time)
Samples generated: Output audio length

┌─ TTS Benchmark ─┐
│ Synthesis: 123ms │
│ RTF:       0.45  │
│ Samples:   22050 │
│ (1 second audio)  │
└──────────────────┘

E2E Pipeline Benchmark

Measures:

E2E latency: Speech input → first audio output
Total latency: Complete pipeline (STT → LLM → TTS)
Long-form: Multi-sentence responses

┌─ E2E Benchmark ─┐
│ E2E:     678ms  │
│ Total:   901ms  │
│                  │
│ Breakdown:       │
│   STT:    234ms  │
│   LLM:    457ms  │
│   TTS:    210ms  │
└──────────────────┘

RAG Benchmark

Measures:

Embedding latency: Query → vector
Retrieval latency: Vector + BM25 search
Full RAG query: Embedding + retrieval + LLM

┌─ RAG Benchmark ─┐
│ Embedding:  5.2ms │
│ Retrieval:  4.1ms │
│ Full query: 510ms │
│ (5 results)       │
└──────────────────┘

RAG benchmark only runs if an index is loaded via rcli_rag_load_index().

Memory Benchmark

Measures:

LLM: Model + KV cache
Embedding: RAG embedding model
STT: Zipformer + Whisper
TTS: Piper/Kokoro
Total: Peak RAM usage

┌─ Memory Usage ─┐
│ LLM:       512 MB │
│ Embedding: 128 MB │
│ STT:        64 MB │
│ TTS:        96 MB │
│ Total:     800 MB │
└──────────────────┘

JSON Export Format

{
  "timestamp": "2025-03-07T14:23:45Z",
  "device": {
    "model": "MacBookPro18,1",
    "chip": "Apple M1 Max",
    "memory_gb": 64
  },
  "models": {
    "llm": "Qwen3 0.6B Q4_K_M",
    "stt": "Whisper base.en",
    "tts": "Piper Lessac"
  },
  "results": {
    "stt": {
      "latency_ms": 234.5,
      "wer_avg": 2.3
    },
    "llm": {
      "ttft_ms": 89.2,
      "tok_per_sec": 42.3,
      "context_usage": 0.12
    },
    "tts": {
      "synthesis_ms": 123.4,
      "rtf": 0.45
    },
    "e2e": {
      "latency_ms": 678.9,
      "total_ms": 901.2
    },
    "rag": {
      "embedding_ms": 5.2,
      "retrieval_ms": 4.1,
      "full_query_ms": 510.3
    },
    "memory": {
      "llm_mb": 512,
      "total_mb": 800
    }
  }
}

Complete Example: Benchmark Runner

#include "api/rcli_api.h"
#include <stdio.h>
#include <time.h>

int main() {
    RCLIHandle handle = rcli_create(NULL);
    
    if (rcli_init(handle, "/path/to/models", 99) != 0) {
        fprintf(stderr, "Initialization failed\n");
        return 1;
    }

    // Optional: Load RAG index for RAG benchmarks
    rcli_rag_load_index(handle, "/path/to/rag_index");

    // Generate timestamped output file
    time_t now = time(NULL);
    struct tm* tm_info = localtime(&now);
    char filename[256];
    strftime(filename, sizeof(filename), "benchmark_%Y%m%d_%H%M%S.json", tm_info);

    printf("Running comprehensive benchmark suite...\n\n");
    
    int result = rcli_run_full_benchmark(
        handle,
        "all",     // All benchmarks
        3,         // 3 runs each
        filename   // Save results
    );

    if (result == 0) {
        printf("\n\nBenchmark complete! Results saved to: %s\n", filename);
    } else {
        fprintf(stderr, "Benchmark failed\n");
    }

    rcli_destroy(handle);
    return result;
}

Compile and run:

clang -o bench bench.c -L./build -lrcli
./bench

# Output:
# Running comprehensive benchmark suite...
#
# ┌─ STT Benchmark ─┐
# ...
# Benchmark complete! Results saved to: benchmark_20250307_142345.json

Performance Targets (M1/M2/M3)

Metric	Target	Good	Excellent
STT latency	< 300ms	< 200ms	< 150ms
LLM TTFT	< 150ms	< 100ms	< 80ms
LLM tok/s	> 30	> 40	> 50
TTS RTF	< 1.0	< 0.5	< 0.3
E2E latency	< 800ms	< 600ms	< 500ms
RAG retrieval	< 10ms	< 5ms	< 3ms

C API

CLI Reference

Overview

Simple Benchmark

rcli_benchmark

Example

Output Example

Comprehensive Benchmark Suite

rcli_run_full_benchmark

Example: Full Benchmark

Example: Selective Benchmarks

Benchmark Categories

STT Benchmark

LLM Benchmark

TTS Benchmark

E2E Pipeline Benchmark

RAG Benchmark

Memory Benchmark

JSON Export Format

Complete Example: Benchmark Runner

Performance Targets (M1/M2/M3)

See Also

Build docs developers (and LLMs) love

C API

CLI Reference

​Overview

​Simple Benchmark

​rcli_benchmark

​Example

​Output Example

​Comprehensive Benchmark Suite

​rcli_run_full_benchmark

​Example: Full Benchmark

​Example: Selective Benchmarks

​Benchmark Categories

​STT Benchmark

​LLM Benchmark

​TTS Benchmark

​E2E Pipeline Benchmark

​RAG Benchmark

​Memory Benchmark

​JSON Export Format

​Complete Example: Benchmark Runner

​Performance Targets (M1/M2/M3)

​See Also

Build docs developers (and LLMs) love

Overview

Simple Benchmark

rcli_benchmark

Example

Output Example

Comprehensive Benchmark Suite

rcli_run_full_benchmark

Example: Full Benchmark

Example: Selective Benchmarks

Benchmark Categories

STT Benchmark

LLM Benchmark

TTS Benchmark

E2E Pipeline Benchmark

RAG Benchmark

Memory Benchmark

JSON Export Format

Complete Example: Benchmark Runner

Performance Targets (M1/M2/M3)

See Also