ICML 2026

CLD: Convex Low-resource Accent-Robust Language Detection in Speech Recognition

Miria Feng1, William Tan2, Mert Pilanci1
1Department of Electrical Engineering, Stanford University, 2Department of Computer Science, Stanford University
CLD headline results (Table 3)

CLD attaches a lightweight convex language-detection head to a frozen ASR encoder (Whisper / MMS-1B), predicting the input language before decoding. Despite training in seconds on pooled encoder features, it consistently matches or beats standard baselines for both language identification and downstream WER, and is especially robust in low-resource and accented regimes.

Overview

Multilingual ASR systems such as Whisper rely on a language token to steer the decoder, and MMS-style systems route audio through language-specific adapters. In both cases, an upstream language identification (LID) step largely determines transcription quality. Off-the-shelf LID modules degrade sharply on accented speech and low-resource languages, where labeled training data is scarce and dialectal variation is high. Standard neural detection heads can close this gap, but require extensive data, long training, and careful hyperparameter tuning.

We propose Convex Language Detection (CLD): a tiny, convex two-layer detection head that operates on pooled ASR encoder features and is trained globally and reproducibly via an ADMM-based convex reformulation, implemented in JAX. CLD plugs into existing ASR pipelines with a few lines of code and produces a calibrated language prediction in a single forward pass, which is then used as the Whisper language token or the MMS adapter id.

The CLD Model

CLD model diagram

CLD slots a tiny convex two-layer network between a frozen ASR encoder and its decoder. At training time, we mean-pool the encoder's hidden states into one utterance vector per clip and fit a convex reformulation of the detection head with ADMM in JAX. Our formulation ensures a single batched, multi-GPU solve that converges to a global optimum without backprop, learning-rate schedules, or hyperparameter sweeps.

At inference, the same pooling + one lightweight forward pass turns each waveform into a predicted language token in well under 500 ms. For Whisper, that token becomes the decoder's start-of-transcript language tag; for MMS, it selects the language-specific adapter. The downstream ASR pipeline is otherwise untouched, so CLD drops in as a few lines of code while making transcription noticeably more robust to accent and low-resource conditions.

Further Experiments

Sample efficiency in low-resource regimes

Sample efficiency (Figures 1 & 2)

We evaluate CLD against traditional vanilla neural networks heads as the per-class training budget consumes 100 to 10,000 utterances per class, with a fixed held-out test set. The NN, and fine-tuned whisper-small baselines improve only as the training budget grows, reflecting the data demands of traditional LID tuning. CLD's accuracy, by contrast, is essentially invariant to dataset size: detection accuracy ranges from 96.94% at 10,000 samples to 99.14% at 1,000 samples, with CLD achieving 98.33% accuracy and 35.95 WER at only 100 samples per class. These results indicate that CLD recovers near-saturation performance with two orders of magnitude less labeled audio than the neural baselines.


Training efficiency under JAX-ADMM

Training efficiency (Table 1)

CLD's convex reformulation admits a unique global optimum and is solved with a single ADMM run in JAX, removing the need for learning-rate scheduling or hyperparameter search. End-to-end training completes in approximately 64 seconds and consumes roughly 13× fewer TFLOPs than the vanilla NN baseline (not including hyperparameter search), while attaining superior accuracy. This reproducibility and low compute footprint make CLD practical for rapid iteration on new language pairs and datasets.


Robustness across dialects and accents

Dialectical robustness (Table 2)

At a constrained 500-sample training budget, the NN head exhibits a strong English bias inherited from the Whisper feature space: it attains 100% on selected English accents but misclassifies 88.88% of Chinese samples as English, with per-dialect accuracy collapsing to 8.78% on Mainland Chinese and 8.84% on Taiwanese Mandarin. This illustrates a well-known failure mode of fine-tuning on imbalanced pretraining representations.

CLD exhibits substantially flatter performance across dialects. It exceeds 94% accuracy on every tested accent and reaches 88.73% on Min Dong Chinese, a dialect on which vanilla Whisper and the fine-tuned NN achieve only 9.86% and 25.35% respectively. The uniformity and low variance across dialects suggest that CLD is a robust choice for deploying ASR in accent-diverse, low-resource conditions.

Quickstart

Install the package from PyPI:

pip install jaxcld

Attach a CLD head to a Whisper model and run inference:

import numpy as np
from cld import ASRModel, CVXNNLangDetectHead

languages = ["en", "hi", "id", "ms", "zh"]
asr = ASRModel.from_pretrained("openai/whisper-small", config={"languages": languages})

head = CVXNNLangDetectHead.load("path/to/whisper-small_trained_cvx_mlp.pkl", asr)
asr.set_lang_detect_head(head)

audio_16k_mono: np.ndarray = ...  # shape (T,), 16 kHz mono
pred_langs, pred_texts = asr.predict(audio_16k_mono)
print(pred_langs[0], pred_texts[0])

BibTeX

@inproceedings{feng2026cld,
  title     = {Convex Low-resource Accent-Robust Language Detection in Speech Recognition},
  author    = {Feng, Miria and Tan, William and Pilanci, Mert},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR},
  url       = {https://icml.cc/virtual/2026/poster/64615}
}