COALA stacks a convex two-layer ReLU network (cvxNN) on the frozen features of a pre-trained backbone
$f_{\theta_\mathrm{pre}}$ and trains it in two convex stages: cvxNN training (Phase I),
followed by reference-free convex preference fine-tuning (Phase II).
Phase I — CRONOS: cvxNN training
We solve the convex reformulation of the two-layer ReLU network
(Pilanci & Ergen, 2020)
with CRONOS, an ADMM solver using preconditioned conjugate gradients
(Feng, Frangella & Pilanci, 2024).
Once the two subproblems are solved, each ADMM iteration reduces to a vector addition and a single
matrix–vector product, which JAX parallelizes efficiently on one GPU. This phase outputs convex-layer
weights $(\Theta_1, \theta_2)$ with an $\mathcal{O}(1/K)$ ergodic convergence guarantee.
Phase II — convex preference fine-tuning
We freeze the base model $f_{\theta_\mathrm{pre}}$ and the first-layer weights $\Theta_1$. The
reference-free COALA loss is then convex in $\theta_2$:
$$
\min_{\theta_2}\;\;
\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\,
\log\!\Big(1 + \exp\!\big(-\beta\, y_w\, \theta_2^\top
(\Theta_1 f_{\theta_\mathrm{pre}}(x))_+ + \gamma\big)\Big).
$$
This is a logistic regression in $\theta_2$ — smooth, convex, with a Lipschitz-continuous gradient
— so accelerated gradient descent attains an $\mathcal{O}(1/k^2)$ convergence rate to the global
optimum; in practice we minimize it with AdamW. Because only the last layer is trained, COALA fits on a
single RTX 4090 (24 GB), whereas DPO and ORPO in our experiments require an A100 (40 GB).