ICML 2026

COALA: Convex Optimization for Alignment and Preference Learning on a Single GPU

Department of Electrical Engineering, Stanford University
COALA reward margins increase monotonically; DPO is noisy and unstable

COALA recasts preference fine-tuning as a convex program over the last layer of a two-layer ReLU network stacked on a frozen backbone, solved reference-free via an ADMM reformulation (CRONOS). Empirically its reward margins rise steadily and stably (above) — consistent with the convex formulation and its convergence guarantees — whereas DPO's are noisier and more hyperparameter-sensitive. COALA drops the reference model, trains end-to-end on a single RTX 4090, and matches or beats DPO and ORPO using as little as ~17.6% of DPO's TFLOPs.

Overview

Aligning large language models with human preferences underpins systems such as ChatGPT and Gemini, but the dominant approaches are heavy. Reinforcement Learning from Human Feedback (RLHF) is computationally expensive and complex, while Direct Preference Optimization (DPO) — though simpler — suffers from inconsistent ranking accuracy, high GPU requirements (it keeps a frozen reference model in memory), and expensive hyperparameter tuning.

We propose COALA (Convex Optimization for Alignment and preference Learning Algorithm): a lightweight, reference-free strategy with strong theoretical guarantees. By leveraging the convex reformulation of two-layer ReLU networks, COALA removes the reference model and obtains a significant reduction in both training time and VRAM, enabling efficient end-to-end training on a single consumer GPU. Across four datasets — including a 26,621-sample synthetic Educational Feedback set — and six models up to Llama-3.1-8B, COALA matches or beats DPO and ORPO while using as little as ~17.6% of DPO's total TFLOPs, and exhibits stable, monotonically increasing rewards. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.

Method

COALA stacks a convex two-layer ReLU network (cvxNN) on the frozen features of a pre-trained backbone $f_{\theta_\mathrm{pre}}$ and trains it in two convex stages: cvxNN training (Phase I), followed by reference-free convex preference fine-tuning (Phase II).

Phase I — CRONOS: cvxNN training

We solve the convex reformulation of the two-layer ReLU network (Pilanci & Ergen, 2020) with CRONOS, an ADMM solver using preconditioned conjugate gradients (Feng, Frangella & Pilanci, 2024). Once the two subproblems are solved, each ADMM iteration reduces to a vector addition and a single matrix–vector product, which JAX parallelizes efficiently on one GPU. This phase outputs convex-layer weights $(\Theta_1, \theta_2)$ with an $\mathcal{O}(1/K)$ ergodic convergence guarantee.

Phase II — convex preference fine-tuning

We freeze the base model $f_{\theta_\mathrm{pre}}$ and the first-layer weights $\Theta_1$. The reference-free COALA loss is then convex in $\theta_2$:

$$ \min_{\theta_2}\;\; \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\, \log\!\Big(1 + \exp\!\big(-\beta\, y_w\, \theta_2^\top (\Theta_1 f_{\theta_\mathrm{pre}}(x))_+ + \gamma\big)\Big). $$

This is a logistic regression in $\theta_2$ — smooth, convex, with a Lipschitz-continuous gradient — so accelerated gradient descent attains an $\mathcal{O}(1/k^2)$ convergence rate to the global optimum; in practice we minimize it with AdamW. Because only the last layer is trained, COALA fits on a single RTX 4090 (24 GB), whereas DPO and ORPO in our experiments require an A100 (40 GB).

Results

Stable, monotonic reward margins

For the three models in the figure at the top of this page, each evaluated with and without SFT initialization, COALA empirically shows stable, steadily increasing reward margins, consistent with its convex formulation and convergence guarantees; DPO trajectories are noisier and more hyperparameter-sensitive, frequently plateauing or decaying.


AlpacaEval2 across 3 models × 3 datasets

COALA matches or beats DPO and ORPO on AlpacaEval2 length-controlled (LC) win rate across all three backbones and datasets.

Method LC Win Rate % Win Rate % Avg Length
EduIMDbUltra EduIMDbUltra EduIMDbUltra
Mistral-7B
COALA24.61±0.3024.88±1.4620.84±1.3523.82±1.3823.11±1.9620.91±1.65592±37418±30459±35
ORPO17.01±2.8217.58±2.3216.04±2.3014.67±2.8715.62±2.4812.28±2.34561±30368±39350±31
DPO24.19±1.1924.30±1.2617.68±1.6722.45±1.5022.74±1.2615.56±2.11492±37502±55453±34
SFT6.80±0.518.42±1.166.18±1.3110.30±1.031.77±1.736.11±1.59515±20428±31463±34
Dolphin-7B
COALA40.81±0.2239.72±0.3631.58±0.3439.05±0.2938.46±0.4130.21±0.38439±38454±33445±37
ORPO25.06±1.9324.90±1.2322.94±1.2623.59±1.1822.92±1.7725.93±2.51526±9448±37452±45
DPO34.73±0.8033.86±0.7226.41±0.6832.46±1.0931.58±1.5224.86±1.02494±34511±49476±35
SFT17.36±0.3816.21±0.2314.88±0.4615.30±0.1015.47±0.4412.76±1.08404±39423±25459±24
Llama-3.1-8B
COALA40.90±0.0927.64±0.2720.64±0.0838.20±0.1125.68±0.3618.32±0.23562±12415±33552±20
ORPO23.87±0.6012.10±0.7112.91±0.5020.58±0.6812.05±0.5510.95±0.55599±70610±33354±41
DPO40.68±0.1021.79±0.2918.89±0.3138.53±0.4720.18±0.4915.81±0.30539±24449±98503±39
SFT10.92±0.208.16±0.117.41±0.3010.75±0.398.11±0.665.62±0.78384±55435±49546±15

Table 1. AlpacaEval2 metrics (mean ± std over 3 runs) by alignment method for three models across three datasets.


Real human evaluation (107 participants)

In a double-blind study, 107 volunteers — 50 from a graduate deep-learning course and 57 from the technology sector — selected the most helpful answer (EduFeedback) or most positive review (IMDb). COALA achieves the highest win rate on both datasets, with 95% Wald intervals lying entirely above the SFT and ORPO baselines, indicating its gains are statistically significant.

DatasetCOALAORPODPOSFT
EduFeedback39.1% ± 9.2%15.5% ± 6.8%28.8% ± 8.6%16.6% ± 7.1%
IMDb42.7% ± 9.4%20.1% ± 7.6%24.8% ± 8.2%12.4% ± 6.2%

Table 2. Human-evaluation win rates (mean ± 95% CI) from a double-blind study with 107 evaluators across two datasets.


Compute efficiency

On Llama-3.1-8B, COALA uses roughly 17.6% of DPO's TFLOPs — a 5.7× reduction — while attaining competitive or superior alignment quality.

ModelCOALADPOORPOSFT
DistilGPT2152.56537.12643.33271
GPT-2379.891,087.451,305.27522
Mistral-7B1,580.459,284.7111,241.892,492
Llama-3.1-8B1,805.3910,253.3712,352.982,851
Dolphin-7B1,794.6610,091.2512,116.502,804

Table 3. TFLOPs measurements per method on the EduFeedback dataset.

EduFeedback dataset

We release EduFeedback: 26,621 GPT-4o-generated student–tutor conversations spanning eleven fields of study such as science and philosophy. Using the novel Alternating Population Strategy we expand these into 65,606 preference triplets without any external reward or re-ranking model: for each prompt, the tutor's immediate response becomes the "chosen" answer and a later, topically related but less direct response from the same conversation becomes the "rejected" answer, exploiting the natural drift from direct answers to tangential follow-ups in objective-domain dialogs.

COALA trained on these pairs achieves a 39.1% win rate in human evaluation, suggesting the strategy captures genuine preference signal at scale. We note this is a domain-specific directness heuristic for objective, succinct dialogs rather than a general preference-labeling method, so only EduFeedback uses it; the three other datasets validate generalization.

Quickstart

The three-stage pipeline reproduces COALA on a single dataset / backbone pair:

# Stage 1 — extract frozen features (chosen / rejected)
python extract.py \
    --model_path  <hf-model-or-sft-ckpt> \
    --data_path   datasets/<your-dataset>/ \
    --pool        attn \
    --output_base extracted_features/

# Stage 2 — Phase I: train cvxNN via CRONOS (ADMM)
python cronos_trainer.py --model_name <model_name>

# Stage 3 — Phase II: convex preference fine-tune θ_2
python finetune_coala.py \
    --model_path  cvxNN_trained_<model_name>/ \
    --output_dir  Finetuned_cvxmlp_<model_name>/

BibTeX

@inproceedings{feng2026coala,
  title     = {Convex Optimization for Alignment and Preference Learning on a Single GPU},
  author    = {Feng, Miria and Pilanci, Mert},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  publisher = {PMLR},
  address   = {Seoul, South Korea}
}

Acknowledgements

This work was supported in part by the NSF CAREER Award (CCF-2236829), the National Institutes of Health (1R01AG08950901A1), the Office of Naval Research (N00014-24-1-2164), and DARPA (HR00112490441). Miria Feng was supported in part by the Stanford Graduate Fellowship. The authors thank Lucy Woof, Zachary Frangella, Kevin Nam, and Zhongwei Dang for valuable feedback and discussions, and all human volunteers at FCS Solutions for participating in the preference-alignment feedback survey.