CLD: Convex Low-resource Accent-Robust Language Detection in Speech Recognition

Overview

Multilingual ASR systems such as Whisper rely on a language token to steer the decoder, and MMS-style systems route audio through language-specific adapters. In both cases, an upstream language identification (LID) step largely determines transcription quality. Off-the-shelf LID modules degrade sharply on accented speech and low-resource languages, where labeled training data is scarce and dialectal variation is high. Standard neural detection heads can close this gap, but require extensive data, long training, and careful hyperparameter tuning.

We propose Convex Language Detection (CLD): a tiny, convex two-layer detection head that operates on pooled ASR encoder features and is trained globally and reproducibly via an ADMM-based convex reformulation, implemented in JAX. CLD plugs into existing ASR pipelines with a few lines of code and produces a calibrated language prediction in a single forward pass, which is then used as the Whisper language token or the MMS adapter id.

The CLD Model

CLD slots a tiny convex two-layer network between a frozen ASR encoder and its decoder. At training time, we mean-pool the encoder's hidden states into one utterance vector per clip and fit a convex reformulation of the detection head with ADMM in JAX. Our formulation ensures a single batched, multi-GPU solve that converges to a global optimum without backprop, learning-rate schedules, or hyperparameter sweeps.

At inference, the same pooling + one lightweight forward pass turns each waveform into a predicted language token in well under 500 ms. For Whisper, that token becomes the decoder's start-of-transcript language tag; for MMS, it selects the language-specific adapter. The downstream ASR pipeline is otherwise untouched, so CLD drops in as a few lines of code while making transcription noticeably more robust to accent and low-resource conditions.

Further Experiments

Sample efficiency in low-resource regimes

We evaluate CLD against traditional vanilla neural networks heads as the per-class training budget consumes 100 to 10,000 utterances per class, with a fixed held-out test set. The NN, and fine-tuned whisper-small baselines improve only as the training budget grows, reflecting the data demands of traditional LID tuning. CLD's accuracy, by contrast, is essentially invariant to dataset size: detection accuracy ranges from 96.94% at 10,000 samples to 99.14% at 1,000 samples, with CLD achieving 98.33% accuracy and 35.95 WER at only 100 samples per class. These results indicate that CLD recovers near-saturation performance with two orders of magnitude less labeled audio than the neural baselines.

Training efficiency under JAX-ADMM

CLD's convex reformulation admits a unique global optimum and is solved with a single ADMM run in JAX, removing the need for learning-rate scheduling or hyperparameter search. End-to-end training completes in approximately 64 seconds and consumes roughly 13× fewer TFLOPs than the vanilla NN baseline (not including hyperparameter search), while attaining superior accuracy. This reproducibility and low compute footprint make CLD practical for rapid iteration on new language pairs and datasets.

Robustness across dialects and accents

At a constrained 500-sample training budget, the NN head exhibits a strong English bias inherited from the Whisper feature space: it attains 100% on selected English accents but misclassifies 88.88% of Chinese samples as English, with per-dialect accuracy collapsing to 8.78% on Mainland Chinese and 8.84% on Taiwanese Mandarin. This illustrates a well-known failure mode of fine-tuning on imbalanced pretraining representations.

CLD exhibits substantially flatter performance across dialects. It exceeds 94% accuracy on every tested accent and reaches 88.73% on Min Dong Chinese, a dialect on which vanilla Whisper and the fine-tuned NN achieve only 9.86% and 25.35% respectively. The uniformity and low variance across dialects suggest that CLD is a robust choice for deploying ASR in accent-diverse, low-resource conditions.

Quickstart

Install the package from PyPI:

pip install jaxcld

Attach a CLD head to a Whisper model and run inference:

import numpy as np
from cld import ASRModel, CVXNNLangDetectHead

languages = ["en", "hi", "id", "ms", "zh"]
asr = ASRModel.from_pretrained("openai/whisper-small", config={"languages": languages})

head = CVXNNLangDetectHead.load("path/to/whisper-small_trained_cvx_mlp.pkl", asr)
asr.set_lang_detect_head(head)

audio_16k_mono: np.ndarray = ...  # shape (T,), 16 kHz mono
pred_langs, pred_texts = asr.predict(audio_16k_mono)
print(pred_langs[0], pred_texts[0])

Pre-trained Models on Hugging Face

Trained convex heads are published on the Hugging Face Hub. Each checkpoint includes a CVXNNLangDetectHead (model.pkl) that pairs with its corresponding frozen ASR backbone. All models load in a single line via hf_hub_download.

Model	Backbone	Languages	Det. Acc	WER ↓	CER ↓	Hub
`cld-whisper-small-5lang`	whisper-small	enhiidmszh	0.98	48.23	27.47	🤗 Hub
`cld-whisper-large-v3-5lang`	whisper-large-v3	enhiidmszh	0.98	31.11	19.81	🤗 Hub
`cld-mms-1b-5lang`	mms-1b-all	enhiidmszh	0.96	48.10	23.47	🤗 Hub
`cld-whisper-small-enzh` (100–10,000 samples/class)	whisper-small	enzh	0.99–1.00	—	—	🤗 Hub

Loading from the Hub

For the five-language models, use the following snippet:

import numpy as np
from huggingface_hub import hf_hub_download
from jaxcld import ASRModel, CVXNNLangDetectHead

languages = ["en", "hi", "id", "ms", "zh"]

# 1) Load the frozen base ASR model
asr = ASRModel.from_pretrained("openai/whisper-small", config={"languages": languages})

# 2) Download the convex head from the Hub and load it
head_path = hf_hub_download("williamhtan/cld-whisper-small-5lang", "model.pkl")
head = CVXNNLangDetectHead.load(head_path, asr)

# 3) Attach the head
asr.set_lang_detect_head(head)

# 4) Grab one clip from the dataset's test split
from datasets import load_dataset

sample = next(iter(load_dataset("williamhtan/cld-multi-dataset", split="test", streaming=True)))
audio_16k_mono = sample["audio"]["array"]   # (T,) float waveform, 16 kHz mono

# 5) Predict
pred_langs, pred_texts = asr.predict(audio_16k_mono)
print("true:", sample["lang"], "| pred:", pred_langs[0], "| text:", pred_texts[0])

For the low-resource binary model, specify the samples-per-class subfolder corresponding to your training budget:

# Example: load the 1,000-samples-per-class checkpoint
head_path = hf_hub_download("williamhtan/cld-whisper-small-enzh", "1000/model.pkl")

Available subfolder options are 100, 500, 1000, and 10000, corresponding to the per-class training budgets evaluated in the sample-efficiency experiments.

BibTeX

@inproceedings{feng2026cld,
  title     = {Convex Low-resource Accent-Robust Language Detection in Speech Recognition},
  author    = {Feng, Miria and Tan, William and Pilanci, Mert},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR},
  url       = {https://icml.cc/virtual/2026/poster/64615}
}