CTC Models

Overview

The CTC (Connectionist Temporal Classification) models provide high-speed parallel generation for speech recognition across 1,600+ languages. Built on the Wav2Vec2 encoder architecture, these models project audio embeddings directly to vocabulary logits using a simple linear projection layer, enabling non-autoregressive parallel prediction.

CTC models are ideal for on-device transcription tasks due to their parallel generation capabilities, resulting in significantly faster throughput compared to autoregressive LLM models (16x-96x faster).

Architecture

The CTC model follows a straightforward encoder-only architecture:

[Audio 16kHz] → Wav2Vec2 Feature Extractor → Wav2Vec2 Encoder → Linear Projection → [Vocab Logits]
                (CNN downsampling ~320x)       (Transformer)     (1024/1280/2048-dim)

Key Components

Wav2Vec2 Feature Extractor: CNN-based module that downsamples raw 16kHz audio by ~320x
Wav2Vec2 Encoder: Transformer encoder producing contextualized audio embeddings
Linear Projection: Simple projection layer mapping embeddings to vocabulary logits
CTC Alignment: Parallel prediction with CTC decoding for frame-to-token alignment

Model Variants

Omnilingual ASR offers CTC models in four sizes, each with v1 and v2 versions:

300M
1B
3B
7B

omniASR_CTC_300M / omniASR_CTC_300M_v2

Parameters: 325,494,996
Download Size: 1.3 GiB (FP32)
Inference VRAM: ~2 GiB (BF16, batch=1, 30s audio)
Speed: 96x faster than real-time (RTF: 0.001)
Embedding Dimension: 1024
Vocabulary Size: 9,812 (v1) / 10,288 (v2)
Use Case: Lightweight deployment, resource-constrained environments

v2 Models: Released in December 2025 with improved accuracy (CER) compared to v1 models. The v2 models use an expanded vocabulary (10,288 tokens) and updated training data.

Usage

Basic Transcription

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Initialize pipeline with CTC model
pipeline = ASRInferencePipeline(model_card="omniASR_CTC_3B_v2")

# Transcribe audio files
audio_files = ["/path/to/audio1.flac", "/path/to/audio2.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=2)

for file_path, text in zip(audio_files, transcriptions):
    print(f"{file_path}: {text}")

Batch Processing for High Throughput

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Use larger batch sizes to maximize parallel processing
pipeline = ASRInferencePipeline(
    model_card="omniASR_CTC_7B_v2",
    device="cuda",
    dtype=torch.bfloat16
)

# Process large dataset efficiently
audio_dataset = [...] # List of audio paths
batch_size = 8  # Adjust based on available VRAM

transcriptions = pipeline.transcribe(audio_dataset, batch_size=batch_size)

Using Pre-decoded Audio

import torch
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_CTC_1B_v2")

# Pass pre-decoded audio dictionaries
audio_data = [
    {"waveform": torch.randn(16000 * 5), "sample_rate": 16000},
    {"waveform": torch.randn(16000 * 10), "sample_rate": 16000}
]

transcriptions = pipeline.transcribe(audio_data, batch_size=2)

Parallel Generation

Unlike autoregressive models that generate tokens sequentially, CTC models predict all tokens in parallel:

Audio Encoding: The Wav2Vec2 encoder processes the entire audio sequence, producing frame-level embeddings
Parallel Projection: Each frame embedding is independently projected to vocabulary logits
CTC Decoding: The CTC algorithm handles frame-to-token alignment, removing duplicates and blank tokens

# Internal CTC decoding (simplified from pipeline.py:310-331)
pred_ids = torch.argmax(logits, dim=-1)  # Parallel argmax over all frames

# Remove consecutive duplicates (CTC decoding)
mask = torch.ones(seq.shape[0], dtype=torch.bool)
mask[1:] = seq[1:] != seq[:-1]
decoded_ids = seq[mask]

transcription = token_decoder(decoded_ids)

This parallel approach enables the dramatic speed improvements (16x-96x faster than real-time).

Input Requirements

Audio Length Limit: CTC models currently accept only audio files shorter than 40 seconds. For longer audio, split into segments or use the LLM Unlimited variants.

Audio Format Support

The pipeline automatically handles:

Resampling: Any sample rate → 16kHz
Channel Conversion: Stereo/multi-channel → Mono
Normalization: Audio amplitude normalization
Format Decoding: WAV, FLAC, MP3, and other common formats

Input Types

# File paths (strings or Path objects)
audio = ["/path/to/file1.wav", "/path/to/file2.flac"]

# Raw bytes
with open("audio.wav", "rb") as f:
    audio = [f.read()]

# NumPy arrays (int8/uint8)
import numpy as np
audio = [np.frombuffer(audio_bytes, dtype=np.int8)]

# Pre-decoded dictionaries
audio = [{"waveform": tensor, "sample_rate": 16000}]

Limitations

CTC Model Limitations

No Language Conditioning: CTC models do not support language ID conditioning (the lang parameter is ignored)
No Context Examples: Zero-shot learning with context examples is not available
Fixed Vocabulary: Cannot adapt to new tokens or writing systems without retraining
No Punctuation: Models output spoken-form text without punctuation or capitalization
40-Second Limit: Maximum audio length capped at 40 seconds per sample

Performance Comparison

Model	Speed (RTF)	Relative to LLM	VRAM (30s)	Best For
CTC 300M	0.001	96x faster	~2 GiB	Edge devices, mobile
CTC 1B	0.002	48x faster	~3 GiB	Balanced deployments
CTC 3B	0.003	32x faster	~8 GiB	Production servers
CTC 7B	0.006	16x faster	~15 GiB	Maximum accuracy
LLM 7B	0.092	1x (baseline)	~17 GiB	Language conditioning

RTF (Real-Time Factor): A value of 0.001 means the model processes 1 second of audio in 0.001 seconds (1000x real-time).

Advanced Configuration

Custom Device and Dtype

import torch
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(
    model_card="omniASR_CTC_7B_v2",
    device="cuda:0",  # Specific GPU
    dtype=torch.float16  # FP16 for faster inference
)

Memory Optimization

For limited VRAM environments:

# Use smaller model
pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M_v2")

# Reduce batch size
transcriptions = pipeline.transcribe(audio_files, batch_size=1)

# Process in chunks
for chunk in chunks(audio_files, chunk_size=10):
    results = pipeline.transcribe(chunk, batch_size=2)

Model Selection Guide

Choose 300M if...

Deploying on edge devices
VRAM is limited (under 4 GiB)
Maximum speed is critical
Accuracy requirements are moderate

Choose 1B if...

Balancing speed and accuracy
Running on consumer GPUs
Processing large volumes
VRAM available: 4-8 GiB

Choose 3B if...

Production deployments
High accuracy needed
Server-grade GPUs available
VRAM available: 8-16 GiB

Choose 7B if...

Maximum accuracy required
Research applications
Large GPU available (A100/H100)
VRAM available: 16+ GiB

Next Steps

LLM Models

Explore autoregressive models with language conditioning

Model Specifications

Compare all models with detailed specifications

Zero-Shot

Learn about in-context learning for new languages

Inference Guide

Comprehensive guide to running inference

Get Started

Guides

Models

Advanced

Overview

Architecture

Key Components

Model Variants

Usage

Basic Transcription

Batch Processing for High Throughput

Using Pre-decoded Audio

Parallel Generation

Input Requirements

Audio Format Support

Input Types

Limitations

Performance Comparison

Advanced Configuration

Custom Device and Dtype

Memory Optimization

Model Selection Guide

Choose 300M if...

Choose 1B if...

Choose 3B if...

Choose 7B if...

Next Steps

LLM Models

Model Specifications

Zero-Shot

Inference Guide

Build docs developers (and LLMs) love

Get Started

Guides

Models

Advanced

​Overview

​Architecture

​Key Components

​Model Variants

​Usage

​Basic Transcription

​Batch Processing for High Throughput

​Using Pre-decoded Audio

​Parallel Generation

​Input Requirements

​Audio Format Support

​Input Types

​Limitations

​Performance Comparison

​Advanced Configuration

​Custom Device and Dtype

​Memory Optimization

​Model Selection Guide

Choose 300M if...

Choose 1B if...

Choose 3B if...

Choose 7B if...

​Next Steps

LLM Models

Model Specifications

Zero-Shot

Inference Guide

Build docs developers (and LLMs) love

Overview

Architecture

Key Components

Model Variants

Usage

Basic Transcription

Batch Processing for High Throughput

Using Pre-decoded Audio

Parallel Generation

Input Requirements

Audio Format Support

Input Types

Limitations

Performance Comparison

Advanced Configuration

Custom Device and Dtype

Memory Optimization

Model Selection Guide

Next Steps