Language ID Format
Languages in Omnilingual ASR follow a standardized format:Format Components
Three-letter ISO 639-3 language code (e.g.,
eng for English, spa for Spanish, cmn for Mandarin)Four-letter ISO 15924 script code (e.g.,
Latn for Latin, Arab for Arabic, Hans for Simplified Chinese)Examples
Accessing the Language List
You can programmatically access the full list of supported languages:Script Coverage
Omnilingual ASR supports multiple writing systems:- Latin Scripts
- Cyrillic Scripts
- Arabic Scripts
- Asian Scripts
- Other Scripts
Latin Script (Latn) - Most widely usedThe majority of supported languages use Latin script, including:
- European languages (English, Spanish, French, German, etc.)
- African languages (Swahili, Yoruba, Hausa, etc.)
- Southeast Asian languages (Indonesian, Vietnamese, Filipino, etc.)
- Indigenous American languages (Quechua, Guarani, Nahuatl, etc.)
Language Statistics
Total Languages
1,682 language-script combinations
High Performance
78% with CER below 10%
New Coverage
Hundreds of previously uncovered languages
Using Language IDs in Code
Here’s how to use language IDs with the inference pipeline:Performance by Language
For detailed per-language performance metrics, see the complete results in the per_language_results_table_7B_llm_asr.csv file.Language Families
Omnilingual ASR covers languages from diverse language families:- Indo-European: Romance, Germanic, Slavic, Indo-Aryan, Iranian languages
- Sino-Tibetan: Chinese varieties, Tibetan, Burmese
- Niger-Congo: Bantu, Atlantic, Gur, Kwa languages
- Austronesian: Indonesian, Filipino, Polynesian languages
- Afro-Asiatic: Arabic, Amharic, Hausa, Hebrew
- Dravidian: Tamil, Telugu, Kannada, Malayalam
- Turkic: Turkish, Uzbek, Kazakh, Uyghur
- Uralic: Finnish, Hungarian, Estonian
- Indigenous American: Quechua, Guarani, Nahuatl, Aymara
- And many more…
Finding Your Language
To find the correct language ID:Identify ISO 639-3 Code
Look up your language’s three-letter code at ISO 639-3 or Ethnologue.Example: English =
eng, Spanish = spaDetermine Script
Identify which script your text uses:
- Latin alphabet →
Latn - Cyrillic →
Cyrl - Arabic →
Arab - See ISO 15924 for complete list
Complete Language List
The complete list of 1,682 supported languages is available in the source code at:View Sample Languages (A-Z)
View Sample Languages (A-Z)
Next Steps
Quick Start
Start transcribing with language codes
Language Conditioning
Improve accuracy with language hints
Model Selection
Choose the right model for your languages
Zero-Shot Learning
Add new languages with examples