Skip to main content
Omnilingual ASR supports over 1,600 languages across multiple scripts, including hundreds of languages never before covered by any ASR system.

Language ID Format

Languages in Omnilingual ASR follow a standardized format:
{language_code}_{script}

Format Components

language_code
string
required
Three-letter ISO 639-3 language code (e.g., eng for English, spa for Spanish, cmn for Mandarin)
script
string
required
Four-letter ISO 15924 script code (e.g., Latn for Latin, Arab for Arabic, Hans for Simplified Chinese)

Examples

"eng_Latn"  # English (Latin script)
"spa_Latn"  # Spanish (Latin script)
"fra_Latn"  # French (Latin script)
"deu_Latn"  # German (Latin script)
"cmn_Hans"  # Mandarin Chinese (Simplified)
"cmn_Hant"  # Mandarin Chinese (Traditional)
"jpn_Jpan"  # Japanese
"kor_Hang"  # Korean (Hangul)
"arb_Arab"  # Standard Arabic
"rus_Cyrl"  # Russian (Cyrillic)
"hin_Deva"  # Hindi (Devanagari)

Accessing the Language List

You can programmatically access the full list of supported languages:
from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langs

# Print total count
print(f"Total supported languages: {len(supported_langs)}")

# Print all languages
for lang in supported_langs:
    print(lang)

Script Coverage

Omnilingual ASR supports multiple writing systems:
Latin Script (Latn) - Most widely usedThe majority of supported languages use Latin script, including:
  • European languages (English, Spanish, French, German, etc.)
  • African languages (Swahili, Yoruba, Hausa, etc.)
  • Southeast Asian languages (Indonesian, Vietnamese, Filipino, etc.)
  • Indigenous American languages (Quechua, Guarani, Nahuatl, etc.)
"eng_Latn", "spa_Latn", "fra_Latn", "deu_Latn", "ita_Latn",
"por_Latn", "pol_Latn", "nld_Latn", "ind_Latn", "vie_Latn"

Language Statistics

Total Languages

1,682 language-script combinations

High Performance

78% with CER below 10%

New Coverage

Hundreds of previously uncovered languages

Using Language IDs in Code

Here’s how to use language IDs with the inference pipeline:
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_Unlimited_7B_v2")

# Transcribe English audio
transcriptions = pipeline.transcribe(
    ["/path/to/english_audio.wav"],
    lang=["eng_Latn"],
    batch_size=1
)

Performance by Language

For detailed per-language performance metrics, see the complete results in the per_language_results_table_7B_llm_asr.csv file.
The 7B LLM-ASR model achieves character error rates (CER) below 10% for 78% of supported languages.

Language Families

Omnilingual ASR covers languages from diverse language families:
  • Indo-European: Romance, Germanic, Slavic, Indo-Aryan, Iranian languages
  • Sino-Tibetan: Chinese varieties, Tibetan, Burmese
  • Niger-Congo: Bantu, Atlantic, Gur, Kwa languages
  • Austronesian: Indonesian, Filipino, Polynesian languages
  • Afro-Asiatic: Arabic, Amharic, Hausa, Hebrew
  • Dravidian: Tamil, Telugu, Kannada, Malayalam
  • Turkic: Turkish, Uzbek, Kazakh, Uyghur
  • Uralic: Finnish, Hungarian, Estonian
  • Indigenous American: Quechua, Guarani, Nahuatl, Aymara
  • And many more…

Finding Your Language

To find the correct language ID:
1

Identify ISO 639-3 Code

Look up your language’s three-letter code at ISO 639-3 or Ethnologue.Example: English = eng, Spanish = spa
2

Determine Script

Identify which script your text uses:
  • Latin alphabet → Latn
  • Cyrillic → Cyrl
  • Arabic → Arab
  • See ISO 15924 for complete list
3

Combine and Validate

Combine as {language}_{script} and check if it’s in the supported list:
from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langs

lang_id = "eng_Latn"
if lang_id in supported_langs:
    print(f"✓ {lang_id} is supported")

Complete Language List

The complete list of 1,682 supported languages is available in the source code at:
src/omnilingual_asr/models/wav2vec2_llama/lang_ids.py
A few example languages from the complete list:
# Sample of supported languages
"aae_Latn",  # Arbëreshë Albanian
"aka_Latn",  # Akan
"amh_Ethi",  # Amharic
"arb_Arab",  # Standard Arabic
"asm_Beng",  # Assamese
"bam_Latn",  # Bambara
"ben_Beng",  # Bengali
"bul_Cyrl",  # Bulgarian
"cat_Latn",  # Catalan
"ces_Latn",  # Czech
"cmn_Hans",  # Mandarin Chinese (Simplified)
"cmn_Hant",  # Mandarin Chinese (Traditional)
"dan_Latn",  # Danish
"deu_Latn",  # German
"ell_Grek",  # Greek
"eng_Latn",  # English
"epo_Latn",  # Esperanto
"eus_Latn",  # Basque
"fas_Arab",  # Persian
"fin_Latn",  # Finnish
"fra_Latn",  # French
"gle_Latn",  # Irish
"glg_Latn",  # Galician
"grn_Latn",  # Guarani
"guj_Gujr",  # Gujarati
"hat_Latn",  # Haitian Creole
"hau_Latn",  # Hausa
"heb_Hebr",  # Hebrew
"hin_Deva",  # Hindi
"hrv_Latn",  # Croatian
"hun_Latn",  # Hungarian
"hye_Armn",  # Armenian
"ibo_Latn",  # Igbo
"ind_Latn",  # Indonesian
"isl_Latn",  # Icelandic
"ita_Latn",  # Italian
"jav_Latn",  # Javanese
"jpn_Jpan",  # Japanese
"kan_Knda",  # Kannada
"kat_Geor",  # Georgian
"kaz_Cyrl",  # Kazakh
"khm_Khmr",  # Khmer
"kin_Latn",  # Kinyarwanda
"kor_Hang",  # Korean
"lao_Laoo",  # Lao
"lat_Latn",  # Latin
"lav_Latn",  # Latvian
"lit_Latn",  # Lithuanian
"lug_Latn",  # Luganda
"mal_Mlym",  # Malayalam
"mar_Deva",  # Marathi
"mkd_Cyrl",  # Macedonian
"mlg_Latn",  # Malagasy
"mon_Cyrl",  # Mongolian
"mri_Latn",  # Maori
"mya_Mymr",  # Burmese
"nep_Deva",  # Nepali
"nld_Latn",  # Dutch
"nob_Latn",  # Norwegian Bokmål
"nya_Latn",  # Chichewa
"ori_Orya",  # Odia
"pan_Guru",  # Punjabi
"pol_Latn",  # Polish
"por_Latn",  # Portuguese
"pus_Arab",  # Pashto
"ron_Latn",  # Romanian
"rus_Cyrl",  # Russian
"sin_Sinh",  # Sinhala
"slk_Latn",  # Slovak
"slv_Latn",  # Slovenian
"sna_Latn",  # Shona
"som_Latn",  # Somali
"spa_Latn",  # Spanish
"sqi_Latn",  # Albanian
"srp_Cyrl",  # Serbian
"sun_Latn",  # Sundanese
"swe_Latn",  # Swedish
"swh_Latn",  # Swahili
"tam_Taml",  # Tamil
"tel_Telu",  # Telugu
"tgk_Cyrl",  # Tajik
"tha_Thai",  # Thai
"tir_Ethi",  # Tigrinya
"tuk_Latn",  # Turkmen
"tur_Latn",  # Turkish
"ukr_Cyrl",  # Ukrainian
"urd_Arab",  # Urdu
"uzb_Latn",  # Uzbek
"vie_Latn",  # Vietnamese
"wol_Latn",  # Wolof
"xho_Latn",  # Xhosa
"ydd_Hebr",  # Yiddish
"yor_Latn",  # Yoruba
"zho_Hans",  # Chinese (Simplified)
"zul_Latn",  # Zulu
# ... and 1,600+ more!

Next Steps

Quick Start

Start transcribing with language codes

Language Conditioning

Improve accuracy with language hints

Model Selection

Choose the right model for your languages

Zero-Shot Learning

Add new languages with examples

Build docs developers (and LLMs) love