Overview
The CTC (Connectionist Temporal Classification) models provide high-speed parallel generation for speech recognition across 1,600+ languages. Built on the Wav2Vec2 encoder architecture, these models project audio embeddings directly to vocabulary logits using a simple linear projection layer, enabling non-autoregressive parallel prediction.CTC models are ideal for on-device transcription tasks due to their parallel generation capabilities, resulting in significantly faster throughput compared to autoregressive LLM models (16x-96x faster).
Architecture
The CTC model follows a straightforward encoder-only architecture:Key Components
- Wav2Vec2 Feature Extractor: CNN-based module that downsamples raw 16kHz audio by ~320x
- Wav2Vec2 Encoder: Transformer encoder producing contextualized audio embeddings
- Linear Projection: Simple projection layer mapping embeddings to vocabulary logits
- CTC Alignment: Parallel prediction with CTC decoding for frame-to-token alignment
Model Variants
Omnilingual ASR offers CTC models in four sizes, each with v1 and v2 versions:- 300M
- 1B
- 3B
- 7B
omniASR_CTC_300M / omniASR_CTC_300M_v2
- Parameters: 325,494,996
- Download Size: 1.3 GiB (FP32)
- Inference VRAM: ~2 GiB (BF16, batch=1, 30s audio)
- Speed: 96x faster than real-time (RTF: 0.001)
- Embedding Dimension: 1024
- Vocabulary Size: 9,812 (v1) / 10,288 (v2)
- Use Case: Lightweight deployment, resource-constrained environments
v2 Models: Released in December 2025 with improved accuracy (CER) compared to v1 models. The v2 models use an expanded vocabulary (10,288 tokens) and updated training data.
Usage
Basic Transcription
Batch Processing for High Throughput
Using Pre-decoded Audio
Parallel Generation
Unlike autoregressive models that generate tokens sequentially, CTC models predict all tokens in parallel:- Audio Encoding: The Wav2Vec2 encoder processes the entire audio sequence, producing frame-level embeddings
- Parallel Projection: Each frame embedding is independently projected to vocabulary logits
- CTC Decoding: The CTC algorithm handles frame-to-token alignment, removing duplicates and blank tokens
Input Requirements
Audio Format Support
The pipeline automatically handles:- Resampling: Any sample rate → 16kHz
- Channel Conversion: Stereo/multi-channel → Mono
- Normalization: Audio amplitude normalization
- Format Decoding: WAV, FLAC, MP3, and other common formats
Input Types
Limitations
CTC Model Limitations
CTC Model Limitations
- No Language Conditioning: CTC models do not support language ID conditioning (the
langparameter is ignored) - No Context Examples: Zero-shot learning with context examples is not available
- Fixed Vocabulary: Cannot adapt to new tokens or writing systems without retraining
- No Punctuation: Models output spoken-form text without punctuation or capitalization
- 40-Second Limit: Maximum audio length capped at 40 seconds per sample
Performance Comparison
| Model | Speed (RTF) | Relative to LLM | VRAM (30s) | Best For |
|---|---|---|---|---|
| CTC 300M | 0.001 | 96x faster | ~2 GiB | Edge devices, mobile |
| CTC 1B | 0.002 | 48x faster | ~3 GiB | Balanced deployments |
| CTC 3B | 0.003 | 32x faster | ~8 GiB | Production servers |
| CTC 7B | 0.006 | 16x faster | ~15 GiB | Maximum accuracy |
| LLM 7B | 0.092 | 1x (baseline) | ~17 GiB | Language conditioning |
RTF (Real-Time Factor): A value of 0.001 means the model processes 1 second of audio in 0.001 seconds (1000x real-time).
Advanced Configuration
Custom Device and Dtype
Memory Optimization
For limited VRAM environments:Model Selection Guide
Choose 300M if...
- Deploying on edge devices
- VRAM is limited (under 4 GiB)
- Maximum speed is critical
- Accuracy requirements are moderate
Choose 1B if...
- Balancing speed and accuracy
- Running on consumer GPUs
- Processing large volumes
- VRAM available: 4-8 GiB
Choose 3B if...
- Production deployments
- High accuracy needed
- Server-grade GPUs available
- VRAM available: 8-16 GiB
Choose 7B if...
- Maximum accuracy required
- Research applications
- Large GPU available (A100/H100)
- VRAM available: 16+ GiB
Next Steps
LLM Models
Explore autoregressive models with language conditioning
Model Specifications
Compare all models with detailed specifications
Zero-Shot
Learn about in-context learning for new languages
Inference Guide
Comprehensive guide to running inference