Learning from Surrogate Data: Weak-to-Strong Generalization Through the Lens of High-Dimensional Regression

Title:Learning from Surrogate Data: Weak-to-Strong Generalization Through the Lens of High-Dimensional Regression

Conference:IMPMS 2026

Tags:Deterministic equivalent, High-dimensional regression, Knowledge distillation, Random feature ridge regression and Weak-to-strong generalization

Abstract:

It is increasingly common in machine learning to use learned models to label data and then employ such data to train more capable models. The phenomenon of weak-to-strong generalization exemplifies the advantage of this two-stage procedure: a strong student is trained on imperfect labels obtained from a weak teacher, and yet the strong student outperforms the weak teacher. In the talk, I will start by considering ridgeless, high-dimensional regression, and I will provide a sharp characterization of the risk of the target model when the surrogate model is either arbitrary or obtained via empirical risk minimization. This shows that weak-to-strong training, with the surrogate as the weak model, provably outperforms training with strong labels under the same data budget, but it is unable to improve the scaling law. Next, I will show that the scaling law can improve when both the student and the teacher are trained via random feature ridge regression. I will derive a dimension-free deterministic equivalent for the risk of the student trained on teacher labels and then, via this deterministic equivalent, I will identify regimes in which the scaling law of the student improves upon that of the teacher. This shows that the improvement can be achieved both in bias-dominated and variance-dominated settings. Strikingly, the student may attain the minimax optimal rate regardless of the scaling law of the teacher -- in fact, when the risk of the teacher does not even decay with the sample size.