Skip to main content

Overview

The CTC (Connectionist Temporal Classification) models provide high-speed parallel generation for speech recognition across 1,600+ languages. Built on the Wav2Vec2 encoder architecture, these models project audio embeddings directly to vocabulary logits using a simple linear projection layer, enabling non-autoregressive parallel prediction.
CTC models are ideal for on-device transcription tasks due to their parallel generation capabilities, resulting in significantly faster throughput compared to autoregressive LLM models (16x-96x faster).

Architecture

The CTC model follows a straightforward encoder-only architecture:
[Audio 16kHz] → Wav2Vec2 Feature Extractor → Wav2Vec2 Encoder → Linear Projection → [Vocab Logits]
                (CNN downsampling ~320x)       (Transformer)     (1024/1280/2048-dim)

Key Components

  • Wav2Vec2 Feature Extractor: CNN-based module that downsamples raw 16kHz audio by ~320x
  • Wav2Vec2 Encoder: Transformer encoder producing contextualized audio embeddings
  • Linear Projection: Simple projection layer mapping embeddings to vocabulary logits
  • CTC Alignment: Parallel prediction with CTC decoding for frame-to-token alignment

Model Variants

Omnilingual ASR offers CTC models in four sizes, each with v1 and v2 versions:
omniASR_CTC_300M / omniASR_CTC_300M_v2
  • Parameters: 325,494,996
  • Download Size: 1.3 GiB (FP32)
  • Inference VRAM: ~2 GiB (BF16, batch=1, 30s audio)
  • Speed: 96x faster than real-time (RTF: 0.001)
  • Embedding Dimension: 1024
  • Vocabulary Size: 9,812 (v1) / 10,288 (v2)
  • Use Case: Lightweight deployment, resource-constrained environments
v2 Models: Released in December 2025 with improved accuracy (CER) compared to v1 models. The v2 models use an expanded vocabulary (10,288 tokens) and updated training data.

Usage

Basic Transcription

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Initialize pipeline with CTC model
pipeline = ASRInferencePipeline(model_card="omniASR_CTC_3B_v2")

# Transcribe audio files
audio_files = ["/path/to/audio1.flac", "/path/to/audio2.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=2)

for file_path, text in zip(audio_files, transcriptions):
    print(f"{file_path}: {text}")

Batch Processing for High Throughput

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Use larger batch sizes to maximize parallel processing
pipeline = ASRInferencePipeline(
    model_card="omniASR_CTC_7B_v2",
    device="cuda",
    dtype=torch.bfloat16
)

# Process large dataset efficiently
audio_dataset = [...] # List of audio paths
batch_size = 8  # Adjust based on available VRAM

transcriptions = pipeline.transcribe(audio_dataset, batch_size=batch_size)

Using Pre-decoded Audio

import torch
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_CTC_1B_v2")

# Pass pre-decoded audio dictionaries
audio_data = [
    {"waveform": torch.randn(16000 * 5), "sample_rate": 16000},
    {"waveform": torch.randn(16000 * 10), "sample_rate": 16000}
]

transcriptions = pipeline.transcribe(audio_data, batch_size=2)

Parallel Generation

Unlike autoregressive models that generate tokens sequentially, CTC models predict all tokens in parallel:
  1. Audio Encoding: The Wav2Vec2 encoder processes the entire audio sequence, producing frame-level embeddings
  2. Parallel Projection: Each frame embedding is independently projected to vocabulary logits
  3. CTC Decoding: The CTC algorithm handles frame-to-token alignment, removing duplicates and blank tokens
# Internal CTC decoding (simplified from pipeline.py:310-331)
pred_ids = torch.argmax(logits, dim=-1)  # Parallel argmax over all frames

# Remove consecutive duplicates (CTC decoding)
mask = torch.ones(seq.shape[0], dtype=torch.bool)
mask[1:] = seq[1:] != seq[:-1]
decoded_ids = seq[mask]

transcription = token_decoder(decoded_ids)
This parallel approach enables the dramatic speed improvements (16x-96x faster than real-time).

Input Requirements

Audio Length Limit: CTC models currently accept only audio files shorter than 40 seconds. For longer audio, split into segments or use the LLM Unlimited variants.

Audio Format Support

The pipeline automatically handles:
  • Resampling: Any sample rate → 16kHz
  • Channel Conversion: Stereo/multi-channel → Mono
  • Normalization: Audio amplitude normalization
  • Format Decoding: WAV, FLAC, MP3, and other common formats

Input Types

# File paths (strings or Path objects)
audio = ["/path/to/file1.wav", "/path/to/file2.flac"]

# Raw bytes
with open("audio.wav", "rb") as f:
    audio = [f.read()]

# NumPy arrays (int8/uint8)
import numpy as np
audio = [np.frombuffer(audio_bytes, dtype=np.int8)]

# Pre-decoded dictionaries
audio = [{"waveform": tensor, "sample_rate": 16000}]

Limitations

  • No Language Conditioning: CTC models do not support language ID conditioning (the lang parameter is ignored)
  • No Context Examples: Zero-shot learning with context examples is not available
  • Fixed Vocabulary: Cannot adapt to new tokens or writing systems without retraining
  • No Punctuation: Models output spoken-form text without punctuation or capitalization
  • 40-Second Limit: Maximum audio length capped at 40 seconds per sample

Performance Comparison

ModelSpeed (RTF)Relative to LLMVRAM (30s)Best For
CTC 300M0.00196x faster~2 GiBEdge devices, mobile
CTC 1B0.00248x faster~3 GiBBalanced deployments
CTC 3B0.00332x faster~8 GiBProduction servers
CTC 7B0.00616x faster~15 GiBMaximum accuracy
LLM 7B0.0921x (baseline)~17 GiBLanguage conditioning
RTF (Real-Time Factor): A value of 0.001 means the model processes 1 second of audio in 0.001 seconds (1000x real-time).

Advanced Configuration

Custom Device and Dtype

import torch
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(
    model_card="omniASR_CTC_7B_v2",
    device="cuda:0",  # Specific GPU
    dtype=torch.float16  # FP16 for faster inference
)

Memory Optimization

For limited VRAM environments:
# Use smaller model
pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M_v2")

# Reduce batch size
transcriptions = pipeline.transcribe(audio_files, batch_size=1)

# Process in chunks
for chunk in chunks(audio_files, chunk_size=10):
    results = pipeline.transcribe(chunk, batch_size=2)

Model Selection Guide

Choose 300M if...

  • Deploying on edge devices
  • VRAM is limited (under 4 GiB)
  • Maximum speed is critical
  • Accuracy requirements are moderate

Choose 1B if...

  • Balancing speed and accuracy
  • Running on consumer GPUs
  • Processing large volumes
  • VRAM available: 4-8 GiB

Choose 3B if...

  • Production deployments
  • High accuracy needed
  • Server-grade GPUs available
  • VRAM available: 8-16 GiB

Choose 7B if...

  • Maximum accuracy required
  • Research applications
  • Large GPU available (A100/H100)
  • VRAM available: 16+ GiB

Next Steps

LLM Models

Explore autoregressive models with language conditioning

Model Specifications

Compare all models with detailed specifications

Zero-Shot

Learn about in-context learning for new languages

Inference Guide

Comprehensive guide to running inference

Build docs developers (and LLMs) love