IOS 2024: INFORMS OPTIMIZATION SOCIETY CONFERENCE 2024
PROGRAM FOR SUNDAY, MARCH 24TH
Days:
previous day
all days

View: session overviewtalk overview

08:30-10:00 Session SunA 1
08:30
Private Online Prediction from Experts: Separations and Faster Rates

ABSTRACT. Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $O(\sqrt{T \log d} + \log d/\epsilon)$ for the stochastic setting and $O(\sqrt{T \log d} + T^{1/3} \log d/\epsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.

08:52
Advancements in Differentially Private Stochastic Convex Optimization: Towards Near-Linear Gradient Query Complexity

ABSTRACT. Differentially Private Stochastic Convex Optimization (DP-SCO) is a fundamental problem in private optimization, focusing on minimizing population loss under differential privacy constraints with i.i.d. samples from an underlying distribution over convex functions. Historically, This problem relied on DP Stochastic Gradient Descent (DP-SGD) variants to achieve optimal population loss. Despite these advances, reducing gradient query complexity, particularly for non-smooth convex functions, remains a formidable challenge. Prior approaches have attained linear gradient query complexity for smooth functions, yet this efficiency eludes non-smooth functions, typically demanding super-linear queries. We progress towards addressing this important open problem in private optimization: achieving near-linear gradient query complexity in non-smooth DP-SCO. We present a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density, which significantly reduces the gradient query complexity. Particularly for large datasets, we use ReSQue to design an algorithm with near-linear gradient query complexity.

Based on joint work with Yair Carmon, Arun Jambulapati, Yujia Jin, Yin Tat Lee, Aaron Sidford, and Kevin Tian.

09:14
Adversarial Robustness in the Streaming Model

ABSTRACT. There has been substantial interest in designing robust algorithms that provide utility guarantees even when their inputs are chosen adaptively. In this talk, we will overview recent algorithmic and analytic techniques for adversarial robustness in the streaming model. In particular, we will discuss exciting connections to differential privacy and cryptography.

09:36
Nearly Optimal Algorithms for Privately Learning Thresholds and Approximating Median

ABSTRACT. The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of O(log(1/\beta) / \alpha) (for generalization error \alpha with confidence 1−\beta). The private version of the problem, however, is more challenging. In particular, the sample complexity must depend on the size |X| of the domain. Progress on quantifying this dependence via lower and upper bounds was made in a line of work over the past decade.

In this talk, I will present a new algorithm for privately learning thresholds with sample complexity O~(log^*|X|). This matches a lower bound of \Omega(log^*|X|) by Alon et al. I will also briefly survey the history of this problem and its connections to other basic problems in differential privacy, including (private) approximate median, distribution learning, and releasing threshold queries. Finally, some recent work on online variants of the problem will be discussed.

08:30-10:00 Session SunA 2
08:30
First-order methods for quantum entanglement detection

ABSTRACT. This talk will present a novel algorithmic approach to tackle the following central problem in quantum information science: given a bipartitate quantum state, detect if the state is entangled by constructing a suitable "entanglement witness". The construction of an entanglement witness can be cast as a challenging convex feasibility problem. Our approach to this convex feasibility problem combines modern convex optimization machinery and a widely known positive-partial transpose (PPT) symmetric extension criteria for entanglement detection.

08:52
Accelerating Continuous Variable Coherent Ising Machines via Momentum

ABSTRACT. The Coherent Ising Machine (CIM) is a non-conventional architecture that takes inspiration from physical annealing processes to solve Ising problems heuristically. Its dynamics are naturally continuous and described by a set of ordinary differential equations that have been proven to be useful for the optimization of continuous variables non-convex quadratic optimization problems. The dynamics of such Continuous Variable CIMs (CV-CIM) encourage optimization via optical pulses whose amplitudes are determined by the negative gradient of the objective; however, standard gradient descent is known to be trapped by local minima and hampered by poor problem conditioning. In this work, we propose to modify the CV-CIM dynamics using more sophisticated pulse injections based on tried-and-true optimization techniques such as momentum and Adam. Through numerical experiments, we show that the momentum and Adam updates can significantly speed up the CV-CIM’s convergence and improve sample diversity over the original CV-CIM dynamics. We also find that the Adam-CV-CIM’s performance is more stable as a function of feedback strength, especially on poorly conditioned instances, resulting in an algorithm that is more robust, reliable, and easily tunable. More broadly, we identify the CIM dynamical framework as a fertile opportunity for exploring the intersection of classical optimization and modern analog computing.

09:14
Constrained local Hamiltonians: quantum generalizations of Vertex Cover

ABSTRACT. Given the diversity of combinatorial optimization problems and their impact on the development of new algorithms, it is natural to ask if we can find quantum generalizations of some of the constrained combinatorial problems, which, in turn, can inspire new quantum algorithms. In this work, we propose a quantum generalization of the vertex cover problem, which we call 'Transverse Vertex Cover' (TVC). We prove that TVC is in the complexity class StoqMA-hard and develop an approximation algorithm for it based on a quantum generalization of the classical local ratio method. Additionally, we demonstrate our quantum local ratio method on a traditional unconstrained quantum local Hamiltonian version of the Vertex Cover, which is equivalent to the well-studied anti-ferromagnetic transverse field Ising model in physics.

09:36
A Quantum Inspired Bi-level Optimization Algorithm for the First Responder Network Design Problem

ABSTRACT. In the aftermath of a sudden catastrophe, First Responders (FR) strive to promptly reach and rescue immobile victims. Simultaneously, mobile individuals take roads to evacuate the disaster zone. The escalated traffic congestion significantly hinders critical FR operations if they are sharing some of the same roads. The objective of the First Responder Network Design Problem (FRNDP) is to allocate a subset of road segments for use by FRs only, mark them clearly, and pre-communicate them to the citizens. We develop a mixed integer non-linear programming formulation for the FRNDP, with the underlying evacuee traffic flow modeled on the selfish routing principle. We solve FRNDP using a hybrid quantum-classical heuristic for solving binary constrained, non-linear optimization problems, the Graver Augmented Multi-Seed Algorithm (GAMA). Using the flow-balance constraints for the FR and evacuee paths, we derive a Quadratic Unconstrained Binary Optimization (QUBO) model to obtain the feasible paths and compute partial Graver bases to move between the feasible solutions of FRNDP. To efficiently explore the solution space for high-quality solutions, we develop a novel bi-level nested GAMA within GAMA: GAGA. We compare GAGA with a state-of-the-art traditional approach on random instances and on graphs generated based on Turkish road network scenarios.

08:30-10:00 Session SunA 3
08:30
Dealing with disjunctive constraints in convex mixed-integer nonlinear programming: Efficient formulations and strong cuts

ABSTRACT. Disjunctive constraints, where you have different alternative sets of constraints that must be satisfied, form one of the key modeling techniques for a variety of problems ranging from process design to applications in AI/ML. It is well known that disjunctive constraints pose a significant computational challenge, as they often result in either a weak continuous relaxation or a numerically challenging problem.

This talk focuses on disjunctive constraints appearing in convex mixed-integer nonlinear programs (MINLPs), but the techniques are also applicable in a linear setting. In the talk, we will briefly cover some modeling alternatives for integrating disjunctive constraints into MINLPs and how we can utilize the knowledge of disjunctive structures to derive strong cuts.

08:52
Sparse multi-term disjunctive cuts for the epigraph of a function of binary variables

ABSTRACT. We propose a new method for separating valid inequalities for the epigraph of a function of binary variables. The proposed inequalities are disjunctive cuts defined by disjunctive terms obtained by enumerating a subset I of the binary variables. We show that by restricting the support of the cut to the same set of variables, a cut can be obtained by solving a linear program with 2^|I| constraints. While this limits the size of the set I used to define the multi-term disjunction, the procedure enables generation of multi-term disjunctive cuts using far more terms than existing approaches. We present two approaches for choosing the subset of variables. Experience on three MILP problems with block diagonal structure using |I| up to size 10 indicates the sparse cuts can often close nearly as much gap as the multi-term disjunctive cuts without this restriction and in a fraction of the time. We also find that including these cuts within a cut-and-branch solution method for these MILP problems leads to significant reductions in solution time or ending optimality gap for instances that were not solved within the time limit.

09:14
Building Formulations for Combinatorial Disjunctive Constraints via Biclique Covers

ABSTRACT. In this work, we study the independent branching (IB) framework of combinatorial disjunctive constraints (CDCs) and identify a class of pairwise IB-representable CDCs: combinatorial disjunctive constraints admitting junction trees. For this class of constraints, the existence of junction trees can be recognized in polynomial time, and we also present a polynomial-time heuristic algorithm for the minimum biclique cover problem on the associated conflict graphs to build small and strong mixed-integer programming formulations. Then, we study mixed-integer programming formulations for the piecewise linear lower and upper bounds (in other words, piecewise linear relaxations) of nonlinear functions that can be modeled by a new class of combinatorial disjunctive constraints (CDCs), generalized nD-ordered CDCs. We propose logarithmically sized ideal non-extended formulations to model the piecewise linear relaxations of univariate and higher-dimensional nonlinear functions under the independent branching framework.

Note that most part of the work was done when the first author was at Rice University.

09:36
Aggregated Limited-Memory Rank-1 Cuts

ABSTRACT. Chvátal-Gomory rank-1 cuts are widely recognized for their significant effectiveness in enhancing the lower bound of linear relaxation of the set partitioning formulation. However, the lack of robustness presents notable challenges in the context of the labeling algorithm. The development of limited-memory rank-1 cuts represents a pioneering effort in substantially mitigating this drawback. Notably, these cuts have successfully reduced computational time by a significant factor. Nevertheless, limited-memory cuts encounter limitations in scenarios where labeling is particularly challenging. This paper introduces a novel approach to memory design by aggregating the memory of cuts, thereby considerably accelerating the process of dominance rule evaluation. The techniques presented in this paper hold promise for further improving the efficiency in solving challenging pricing problems.

08:30-10:00 Session SunA 4
08:30
Stochastic FISTA with fully adapative step sizes.

ABSTRACT. We will present proximal gradient methods that are variants of ISTA and FISTA algorithms, with adaptive step sizes where gradients are not exact but based on stochastic oracles. We will utilize stochastic step search framework, such as in Cartis and Scheinberg (2017). Using this framework for analysis of proximal methods and also accelerated first order method is novel and nontrivial. Accelerated methods typically do not allow increasing the step size parameter, which make them inapplicable in this stochastic framework. We will use the variant of FISTA with fully adaptive step sizes from Scheinberg Goldfarb and Bai (2010) and modify it for the stochastic case. Another difficulty is that unlike standard gradient methods, for accelerated methods function value does not have martingale-like behavior, thus the analysis used in Cartis and Scheinberg (2017) does not readily apply. We introduce a new quantity that exhibits martingale-like behavior, allowing us to bound the expected complexity for both stochastic ISTA and FISTA methods.

08:52
Stochastic Adaptive Regularization Method with Cubics: A High Probability Complexity Bound

ABSTRACT. We present a high probability complexity bound for a stochastic adaptive regularization method with cubics, also known as regularized Newton method. The method makes use of stochastic zeroth-, first- and second-order oracles that satisfy certain accuracy and reliability assumptions. Such oracles have been used in the literature by other stochastic adaptive methods, such as trust region and line search. These oracles capture many settings, such as expected risk minimization, stochastic zeroth-order optimization, and others. In this paper, we give the first high probability iteration bound for stochastic cubic regularization, and show that just as in the deterministic case, it is superior to other stochastic adaptive methods.

09:14
Exploiting Negative Curvature in Conjunction with Adaptive Sampling: Theoretical Results and a Practical Algorithm

ABSTRACT. With the emergence of deep learning in recent years, nonconvex optimization has gained more interest and focus, but its nonconvex properties bring us great challenges in algorithm design. Our goal is to develop algorithms that have second-order convergence and affordable complexity so that they can be used for large-scale problems. Therefore, we’ll exploit negative curvature with adaptive sampling strategy. First, we provide the convergence result for the deterministic setting where the gradients and Hessians are inexact. In the second part, I want to show our result in the stochastic setting where the gradients are constructed by random samples. Except for the theory part, I’ll also talk about our practical algorithm which is the Newton-CG method with negative curvature detection and adaptive sampling. Our experiment results on different nonconvex problems are shown after that.

09:36
Efficient Gradient Tracking Algorithmic Framework for Decentralized Optimization

ABSTRACT. Gradient tracking optimization algorithms have received significant attention over the past years for distributed optimization over networks due to their ability to converge to the solution under a constant step size. At every iteration, these algorithms require a computation step of calculating gradients at each node and a communication step to share the iterate and a gradient estimate among nodes. The complexity of these two steps varies significantly across different applications of distributed optimization. In this talk, we present an algorithmic framework that decomposes these two steps, provides flexibility among these steps at each iteration, and is robust to the stochasticity inherent in these steps. We provide optimal theoretical convergence and complexity guarantees, and illustrate its performance on quadratic and classification problems.

08:30-10:00 Session SunA 5
08:30
On the number of pivots of Dantzig's simplex methods for linear and convex quadratic programs

ABSTRACT. Our recent work presents new results on the number of pivots of simplex-type methods for solving linear programs of the Leontief kind, certain linear complementarity problems of the P kind, and nonnegative constrained convex quadratic programs. Our results contribute to the further understanding of the complexity and efficiency of simplex-type methods for solving these problems.

08:52
Linear convergence of proximal point methods under Polyak-Łojasiewicz and quadratic growth conditions

ABSTRACT. Many machine learning problems lack strong convexity properties. Fortunately, recent studies have revealed that first-order algorithms also enjoy linear convergences under various weaker regularity conditions. While the relationship among different conditions for convex and smooth functions is well understood, it is not the case for the nonsmooth setting. In this talk, we go beyond convexity and smoothness, and clarify the connections among common regularity conditions (including strong convexity, restricted secant inequality, subdifferential error bound, Polyak-Łojasiewic inequality, and quadratic growth) in the class of weakly convex functions. In addition, we present a simple and modular proof for the linear convergence of the proximal point method (PPM) for convex (possibly nonsmooth) optimization using these regularity conditions. Linear convergence also holds when the subproblems of PPM are solved inexactly with a proper control of inexactness. Our convergence results require weaker conditions while having simpler proofs than many existing characterizations. We expect practical applications of these fast linear convergences of PPM to large-scale, sparse conic optimization.

09:14
An Outer Approximation Method for Solving Mixed-Integer Convex Quadratic Programs with Indicators

ABSTRACT. Mixed-integer convex quadratic programs with indicator variables (MIQP) encompass a wide range of applications, from statistical learning to energy, finance, and logistics. The outer approximation (OA) algorithm has been proven efficient in solving MIQP, and the key to the success of an OA algorithm is the strength of the cutting planes employed. In this paper, we propose a new technique for deriving cutting planes for MIQP from various convex relaxations, and, as a result, we develop new OA algorithms for solving MIQP at scale. The contributions of our work are two-fold: (1)we bridge the work on the convexification of MIQP and the algorithm design to solve large-scale problems, and (2) we demonstrate through a computational study on the sparse portfolio selection problem that our algorithms give rise to significant speedups compared with the state-of-the-art methods in the literature.

09:36
On Integer Programs with Irrational Data

ABSTRACT. An integer program (IP) with a finite number of feasible solutions may have an unbounded linear programming relaxation if it contains irrational parameters, due to implicit constraints enforced by the irrational numbers. We show that those constraints can be obtained if the irrational parameters are polynomials of roots of integers over the field of rational numbers, leading to an equivalent rational formulation. We also establish a weaker result for IPs involving the general class of algebraic irrational parameters, which extends to IPs with a particular form of transcendental numbers.

08:30-10:00 Session SunA 6
08:30
Symbolic Learning to Optimize

ABSTRACT. Recent studies on Learning to Optimize (L2O) suggest a promising path to automating and accelerating the optimization procedure for complicated tasks. Existing L2O models parameterize optimization rules by neural networks, and learn those numerical rules via meta-training. However, they face two common pitfalls: (1) scalability: the numerical rules represented by neural networks create extra memory overhead for applying L2O models, and limits their applicability to optimizing larger tasks; (2) interpretability: it is unclear what each L2O model has learned in its black-box optimization rule, nor is it straightforward to compare different L2O models in an explainable way. To avoid both pitfalls, this paper demonstrates that we can kill two birds by one stone, by introducing the powerful tool of symbolic regression to L2O. In this paper, we establish a holistic symbolic representation and analysis framework for L2O, which yields a series of insights for learnable optimizers. Leveraging our findings, we further propose a lightweight L2O model that can be meta-trained on large-scale problems and can outperform human-designed and tuned optimizers.

08:52
Differentiable Cutting-plane Layers for Mixed-integer Linear Optimization

ABSTRACT. We consider the problem of solving a family of parametric mixed-integer linear optimization problems where some entries in the input data change. We introduce the concept of cutting-plane layer (CPL), i.e., a differentiable cutting-plane generator mapping the problem data and previous iterates to cutting planes. We propose a CPL implementation to generate split cuts, and by combining several CPLs, we devise a differentiable cutting-plane algorithm that exploits the repeated nature of parametric instances. In an offline phase, we train our algorithm by updating the internal parameters controlling the CPLs, thus altering cut generation. Once trained, our algorithm computes, with predictable execution times and a fixed number of cuts, solutions with low integrality gaps. Preliminary computational tests show that our algorithm generalizes on unseen instances and captures underlying parametric structures.

09:14
Accelerating Non-Convex Optimization via Learned Sequential Convexifications

ABSTRACT. We present a data-driven convexification method for parametric non-convex optimization. Central to our approach is the prox-linear method, which solves a sequence of regularized convex subproblems to compute a candidate solution. Our framework consists of a neural network mapping the problem parameters to an initial point followed by a given number of prox-linear algorithm steps returning a candidate solution. We train our architecture to maximize solution quality by learning the weights of the neural network along with the regularization coefficient. To ensure good performance on unseen data, we provide strong generalization guarantees using the PAC-Bayes methodology. We illustrate the effectiveness of our framework in reducing the solve time to reach high-quality solutions in examples from robotics, signal processing, and machine learning.

09:36
Neuromancer: Differentiable Programming Library for Learning to Optimize and Control

ABSTRACT. Neural Modules with Adaptive Nonlinear Constraints and Efficient Regularizations (NeuroMANCER) is an open-source differentiable programming library for solving parametric constrained optimization problems, physics-informed system identification, and parametric model-based optimal control. NeuroMANCER is written in PyTorch and allows for systematically integrating machine learning with scientific computing to create end-to-end differentiable models and algorithms embedded with prior knowledge and physics. The library is focused on research, rapid prototyping, and streamlined deployment. Strong emphasis is given to extensibility and interoperability with the PyTorch ecosystem with quick adaptability to custom domain problems. Our open-source repository contains numerous tutorial examples demonstrating the use of physics-informed neural networks for solution and parameter estimation of differential equations, learning to optimize with feasibility restoration layers and differentiable control algorithms for learning constrained control policies for nonlinear systems.

08:30-10:00 Session SunA 7
08:30
New Sample Complexity Bounds for (Regularized) Sample Average Approximation in Several Heavy-Tailed, Non-Lipschitzian, and High-Dimensional Cases

ABSTRACT. We study the sample complexity of sample average approximation (SAA) and its simple variations, referred to as the regularized SAA (RSAA), in solving convex and strongly convex stochastic programming (SP) problems. We focus on scenarios characterized by heavy-tailed-ness, non-Lipschitz-ness, and/or high dimensionality. The presence of such irregularities underscores critical vacuum in the literature. In response, this paper presents three sets of results in these scenarios and these new results deviate from the SAA's typical sample complexities that grow polynomially with dimensionality.

08:52
A Distribution Optimization Framework for Confidence Bounds of Risk Measures

ABSTRACT. We present a distribution optimization framework that significantly improves confidence bounds for various risk measures compared to previous methods. Our framework encompasses popular risk measures such as the entropic risk measure, conditional value at risk (CVaR), spectral risk measure, distortion risk measure, equivalent certainty, and rank-dependent expected utility, which are well established in risk-sensitive decision-making literature. To achieve this, we introduce two estimation schemes based on concentration bounds derived from the empirical distribution, specifically using either the Wasserstein distance or the supremum distance. Unlike traditional approaches that add or subtract a confidence radius from the empirical risk measures, our proposed schemes evaluate a specific transformation of the empirical distribution based on the distance. Consequently, our confidence bounds consistently yield tighter results compared to previous methods. We further verify the efficacy of the proposed framework by providing tighter problem-dependent regret bound for the CVaR bandit.

09:14
Self-Structured Importance Sampling for Chance-Constrained Optimization

ABSTRACT. Chance constraints naturally emerge in systems where reducing risk is crucial, particularly those exposed to high-impact but rare events such as natural disasters, infrastructure failures, and financial crashes. However, the rarity of such events renders the solution of chance-constrained models impractical using naive Monte Carlo methods. Importance sampling can be a viable solution method in such cases, as it employs an alternate probability distribution to generate samples more frequently from rare-event regions. Nonetheless, the challenge of selecting an appropriate alternate distribution that is effective for all decision variables across various problems remains unresolved. Most existing strategies are problem-specific, necessitating an in-depth understanding of the system's dynamics. Our work adapts the self-structured importance sampling (SSIS) method, recently proposed by Deo \& Murthy, to develop a new algorithm for solving chance-constrained problems. This method transforms samples generated through naive Monte Carlo to samples that more accurately represent rare region behavior and can be universally applied across a wide range of constraint functions and probability distributions. Our results demonstrate that the proposed algorithm can substantially outperform Monte Carlo sampling in terms of both feasibility and optimality, even with small sample sizes.

09:36
Assessing Feasibility With Respect to Stochastic Constraints

ABSTRACT. Stochastic constraints, which constrain an expectation in the context of simulation optimization, can be hard to conceptualize and harder still to assess. As with a deterministic constraint, a solution is considered either feasible or infeasible with respect to a stochastic constraint. This perspective belies the subjective nature of stochastic constraints, which often arise when attempting to avoid alternative optimization formulations with multiple objectives or an aggregate objective with weights. Moreover, a solution's feasibility with respect to a stochastic constraint cannot, in general, be ascertained based on only a finite number of simulation replications. We introduce different means of estimating how "close" the expected performance of a given solution is to being feasible with respect to one or more stochastic constraints. We explore how these metrics and their bootstrapped error estimates can be incorporated into plots showing a solver's progress over time when solving a stochastically constrained problem.

08:30-10:00 Session SunA 8
08:30
Policy learning without overlap: Pessimism and generalized Bernstein's inequality

ABSTRACT. We propose a pessimism-based algorithm for offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn the optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded. As one has no control over the data collection process, this assumption can be unrealistic in many situations.

Our new algorithm optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed by quantifying the uncertainty of AIPW-type estimators via a new self-normalized concentration inequality, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data. Without assuming any uniform overlap condition, we show that the suboptimality of our algorithm, which depends only on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast.

08:52
First-order Methods for Robust Markov Decision Process

ABSTRACT. We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For (s,a)-rectangular uncertainty sets, we develop a policy-based first-order method, namely the robust policy mirror descent (RPMD), and establish linear convergence when the transition is known. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment, which attains an $O(1/\epsilon^2)$ sample complexity for finding an $\epsilon$-optimal policy.

09:14
Provably learning a multi-head attention layer

ABSTRACT. Despite the widespread empirical success of transformers, little is known about their learnability from a computational perspective. In practice these models are trained with SGD on a certain next-token prediction objective, but in theory it remains a mystery even to prove that such functions can be learned efficiently at all. In this work, we give the first nontrivial provable algorithms and computational lower bounds for this problem. Our results apply in a realizable setting where one is given random sequence-to-sequence pairs that are generated by some unknown multi-head attention layer. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning multi-layer perceptrons, which predominantly exploit fine-grained algebraic and rotation invariance properties of the input distribution.

09:36
Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization

ABSTRACT. Abstract Diffusion models, rivaling GANs for high-quality sample generation, leverage score matching to learn the score function. Despite empirical success, the provable accuracy of gradient-based algorithms for this task remains unclear. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. The analysis covers both the optimization and the generalization aspects of the training procedure. We propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to standard supervised learning, the score-matching problem introduces distinct challenges, including unbounded inputs, vector-valued outputs, and an additional time variable, preventing existing techniques from being applied directly. With a properly designed neural network architecture, we demonstrate an accurate approximation of the score function using a reproducing kernel Hilbert space induced by neural tangent kernels. We establish the first generalization error bound for learning the score function by applying early stopping and coupling arguments.

08:30-10:00 Session SunA 9
08:30
Decision-making with Side Information: A Causal Transport Robust Approach

ABSTRACT. We consider stochastic optimization with side information where, prior to decision-making, covariate data are available to inform better decisions. In particular, we propose to consider a distributionally robust formulation based on causal transport distance. Compared with divergence and the Wasserstein metric, the causal transport distance is better at capturing the information structure revealed from the conditional distribution of random problem parameters given the covariate values. We derive a dual reformulation for evaluating the worst-case expected cost and show that the worst-case distribution in a causal transport distance ball has a similar conditional information structure as the nominal distribution. When optimizing over affine decision rules, we identify cases where the overall problem can be solved by convex programming. When optimizing over all (non-parametric) decision rules, we identify a new class of robust optimal decision rules when the cost function is convex with respect to a one-dimensional decision variable.

08:52
Integrated Conditional Estimation-Optimization

ABSTRACT. Big data provides new opportunities to tackle one of the main difficulties in decision-making systems—uncertain behavior. Standard data-driven approaches usually consist of two steps. The first step involves predicting or estimating the uncertainty behavior using data. Then the second step requires finding decisions that optimize an objective function that depends on the output of the first step. In contrast to the standard two-step approach, we propose an integrated conditional estimation-optimization (ICEO) framework that estimates the underlying conditional distribution of the random parameter while considering the structure of the optimization problem. We directly model the relationship between the conditional distribution of the random parameter and the contextual features, and then estimate the probabilistic model with an objective that aligns with the downstream optimization problem. We show that our ICEO approach is asymptotically consistent under moderate regularity conditions and further provides finite performance guarantees in the form of generalization bounds. Computationally, performing estimation with the ICEO approach is a non-convex and often non-differentiable optimization problem. We propose a general methodology for approximating the potentially non-differentiable mapping from estimated conditional distribution to the optimal decision by a differentiable function, which greatly improves the performance of gradient-based algorithms applied to the non-convex problem. We also provide a polynomial optimization solution approach in the semi-algebraic case. Numerical experiments are also conducted to show the empirical success of our approach in different situations including with limited data samples and model mismatches.

09:14
Contextual Stochastic Bilevel Optimization

ABSTRACT. We introduce contextual stochastic bilevel optimization (CSBO) -- a stochastic bilevel optimization framework with the lower-level problem minimizing an expectation conditioned on some contextual information and the upper-level decision variable. This framework extends classical stochastic bilevel optimization when the lower-level decision maker responds optimally not only to the decision of the upper-level decision maker but also to some side information and when there are multiple or even infinite many followers. It captures important applications such as meta-learning, personalized federated learning, end-to-end learning, and Wasserstein distributionally robust optimization with side information (WDRO-SI). Due to the presence of contextual information, existing single-loop methods for classical stochastic bilevel optimization are unable to converge. To overcome this challenge, we introduce an efficient double-loop gradient method based on the Multilevel Monte-Carlo (MLMC) technique and establish its sample and computational complexities. When specialized to stochastic nonconvex optimization, our method matches existing lower bounds. For meta-learning, the complexity of our method does not depend on the number of tasks.

09:36
SEQUENTIAL COUNTERFACTUAL DECISIONS

ABSTRACT. Given a probabilistic classifier in a binary classification problem, a counterfactual solution to a record is a feasible solution, close to the record and having a high probability of being labeled in the positive class by the classifier. Finding a counterfactual solution for a given record amounts to finding a Pareto-optimal solution to the problem of simultaneous maximization of closeness as well as the probability of classification in the positive class.

In this talk, we address the problem where, instead of one counterfactual solution, we seek a counterfactual solutions plan, i.e., a sequence of T solutions with increasing probabilities of positive and minimizing (a function of) the overall distances. The so-obtained optimization problems will be analyzed for different choices of the classifier and metrics induced by norms, or, in their asymmetric version, gauges.

08:30-10:00 Session SunA10
08:30
Adaptive Pricing in Unit Commitment Under Load and Capacity Uncertainty

ABSTRACT. In this work, we introduce adaptive marginal pricing in energy markets using adaptive robust optimization (ARO) with linear decision rules. The increase of renewables in the grid and the volatility of the load create uncertainties in the day-ahead prices of electricity markets. ARO and stochastic optimization have been used to make commitment and dispatch decisions that adapt to the load and capacity uncertainty. These approaches have been successfully applied in practice but current pricing approaches have two major disadvantages: a) they are deterministic in nature, that is they do not adapt to the load and capacity uncertainty, and b) require uplift payments to the generators that are typically determined by ad hoc procedures and create inefficiencies that motivate self-scheduling. We extend marginal pricing mechanisms to propose the first adaptive pricing method in electricity markets that adapts to the load and capacity uncertainty, eliminates post-market uplifts and deters self-scheduling, addressing both disadvantages.

08:52
An exact method for a class of robust nonlinear optimization problems

ABSTRACT. We propose a novel exact approach to solve a wide class of optimization problems with robust nonlinear constraints. We assume that the constraints are composed of sums of products of linear times concave (SLC) functions with respect to the uncertain parameters. Our approach synergizes a cutting set method with reformulation-perspectification techniques and branch and bound. We demonstrate that our approach is also applicable to robust convex optimization and adjustable robust linear optimization, which can be reformulated as a problem involving sum of linear times linear functions in the uncertain parameters, thus broadening the scope of the existing literature. Numerical experiments on a robust convex geometric optimization problem, a robust linear optimization problem with data uncertainty and implementation error, as well as a lot-sizing problem, confirm the efficacy of our proposed approach.

09:14
Cone product reformulation for global optimization

ABSTRACT. In this paper, we study nonconvex optimization problems involving sum of linear times convex (SLC) functions as well as conic constraints belonging to one of the five basic cones, that is, linear cone, second order cone, power cone, exponential cone, and semidefinite cone. By using the Reformulation Perspectification Technique, we can obtain a convex relaxation by forming the perspective of each convex function and linearizing all product terms with newly introduced variables. To further tighten the approximation, we can pairwise multiply the conic constraints. In this paper, we analyze all possibilities of multiplying conic constraints. Especially the results for the cases in which a power cone or an exponential cone is involved are new. Moreover, in case of an exponential cone we generate valid inequalities that can be used to further strengthen the approximation and in case of a power cone we generate additional valid inequalities. Numerical experiments on a quadratic optimization problem over exponential cone constraints and on a robust palatable diet problem over power cone constraints, demonstrate that including additional inequalities generated from the proposed pairwise multiplications improve the approximation. Moreover, when incorporated in branch and bound the global optimal solution of the original nonconvex optimization problem can often be obtained faster than BARON.

08:30-10:00 Session SunA11
08:30
Random block coordinate descent methods for computing optimal transport and convergence analysis

ABSTRACT. The optimal transport (OT) problem is a classic math problem over 200 years. It is closely related to the concept of Wasserstein distance to measure the distance between probability distributions, thus applicable to generative AI, distributionally robust optimization, and image processing etc. OT can be reduced to a linear programming (LP) problem via discretization. This LP is potentially of large-scale and its resolution remains challenging. In this talk, we will introduce the random block coordinate descent (RBCD) methods to directly solve this LP problem without additional regularization. Our approach involves restricting the potentially large-scale optimization problem to small LP subproblems constructed via randomly chosen working sets. By using a random Gauss-Southwell-q rule to select these working sets, we equip the vanilla version of RBCD0 with almost sure convergence and a linear convergence rate to solve general standard LP problems. To further improve the efficiency of the RBCD0 method, we explore the special structure of constraints in the OT problems and leverage the theory of linear systems to propose several approaches for refining the random working set selection and accelerating the vanilla method. Our preliminary numerical experiments illustrate several merits of the accelerated random block coordinate descent (ARBCD) via comparison with the Sinkhorn’s algorithm and a recent interior point-inspired algorithm.

08:52
Generalizations of R0 and SSM properties for Extended Horizontal Linear Complementarity Problem

ABSTRACT. In this paper, we first introduce column R0-W and column SSM-W property for the set of matrices which is a generalization of R0 and the strictly semimonotone matrix. We then prove some existence results for the extended horizontal linear complementarity problem when the involved matrices have these properties. With an additional condition on the set of matrices, we prove that the column SSM-W property is equivalent to the unique solution for the corresponding extended horizontal linear complementarity problems. Finally, we give a necessary and sufficient condition for the connectedness of the solution set of the extended horizontal linear complementarity problems.

09:14
Solving Large-Scale Non-convex Optimization Problems in Neural Network Verification

ABSTRACT. Formal verification methods for deep neural networks aim to rigorously guarantee their trustworthiness, ensuring properties like safety, security, robustness, and correctness. These guarantees are essential for applying neural networks in mission-critical applications, such as autonomous systems and medical devices. In this talk, I will first introduce the concept of neural network verification and its problem formulation from an optimization perspective. Solving this emerging optimization problem using traditional methods, such as mixed integer programming, is often intractable for practical neural network sizes. To address this challenge, I will discuss a novel framework called “linear bound propagation methods”, which exploits the structure of this problem by efficiently propagating inequalities on neural networks. It is scalable to large neural networks with millions of neurons and achieves up to three orders of magnitude speedup compared to traditional algorithms. Finally, I will introduce the award-winning toolbox, α,β-CROWN, built on this framework and its novel application to non-convex optimization problems beyond canonical neural network verification.

13:00-14:30 Session SunB 1
13:00
Calibrated Stackelberg Games: Learning Optimal Commitments Against Calibrated Agents

ABSTRACT. In this paper, we introduce a generalization of the standard Stackelberg Games (SGs) framework: Calibrated Stackelberg Games (CSGs). In CSGs, a principal repeatedly interacts with an agent who (contrary to standard SGs) does not have direct access to the principal's action but instead best-responds to calibrated forecasts about it. CSG is a powerful modeling tool that goes beyond assuming that agents use ad hoc and highly specified algorithms for interacting in strategic settings and thus more robustly addresses real-life applications that SGs were originally intended to capture. Along with CSGs, we also introduce a stronger notion of calibration, termed adaptive calibration, that provides fine-grained any-time calibration guarantees against adversarial sequences. We give a general approach for obtaining adaptive calibration algorithms and specialize them for finite CSGs. In our main technical result, we show that in CSGs, the principal can achieve utility that converges to the optimum Stackelberg value of the game both in finite and continuous settings, and that no higher utility is achievable. Two prominent and immediate applications of our results are the settings of learning in Stackelberg Security Games and strategic classification, both against calibrated agents.

13:22
Content Creator Incentives in Recommender Systems

ABSTRACT. When designing and evaluating content recommender systems such as Youtube or Netflix, it is common to treat the landscape of content available on the platform as static. However, in reality, the recommendation algorithm shapes the incentives of content creators and thus the content landscape, which can lead to unintended societal impacts. In this talk, we develop a high-dimensional, game-theoretic framework to analyze creator incentives, and we theoretically and empirically characterize downstream impacts on content diversity and clickbait prevalence.

In our model, content creators strategically choose what D-dimensional content to create when competing for user consumption, while the platform optimizes for engagement. First, we characterize how content diversity depends on user preferences, production costs, and the recommendation algorithm hyperparameters, under the simplifying assumption that engagement aligns with user value. Then, we turn to misaligned engagement metrics which reward both clickbait and quality: we characterize clickbait prevalence and downstream performance impacts. More broadly, our results highlight the need to carefully account for the endogeneity of the content landscape when evaluating recommendation algorithms.

Based on joint work with Nikhil Garg, Nicole Immorlica, Brendan Lucier, and Jacob Steinhardt.

13:44
Convex Optimization-based Algorithms for Strategic Classification

ABSTRACT. We consider an online strategic classification problem, where, manipulated data points are reported by agents that aim at getting a positive prediction under a limited budget for the cost of manipulation. The learner, with only access to the manipulated data, seeks to predict the agent’s true labels. In our model, the cost of manipulation is given as the distance between true and manipulated data, where the distance function is a general norm. We first analyze a strategic perceptron algorithm from the perspective of projected gradient descent. We provide a necessary and sufficient condition for its convergence and also establish its finite mistake bound. Unfortunately, strategic perceptron does not offer any margin guarantees. To address this, we propose a new margin maximization algorithm and establish its convergence to the optimal classifier in the stochastic setting for ℓ2 manipulation cost. Despite its strong theoretical guarantee, our margin maximization algorithm is expensive as it requires resolving a classification problem with a growing dataset in each iteration. Thus, motivated by the joint estimation-optimization approach, we further propose an efficient variant of this algorithm with a significantly lower per-iteration cost and show that it still converges to the best margin classifier under appropriate assumptions.

14:06
Enhancing the Diversity in Language Model Assisted Fundraising

ABSTRACT. Large Language Models (LLMs) have increasingly been used to assist in writing. We explore a setting where users of a centralized platform compete for a common resource, primarily through their postings (e.g., fundraising). These language models can assist users in composing fundraising posts. However, a lack of diversity in these posts can lead to boredom and fatigue, resulting in overall suboptimal outcomes. To combat this, we have developed a general optimization framework to enhance the diversity of language model-generated posts. This framework introduces methods to elicit various writing styles and provides approximation algorithms to solve the optimization problem efficiently. We demonstrate the efficacy of our method using the DonorsChoose dataset.

13:00-14:30 Session SunB 2
13:00
Towards Constituting Mathematical Structures for Learning to Optimize

ABSTRACT. Learning to Optimize (L2O), a technique that utilizes machine learning to learn an optimization algorithm automatically from data, has gained arising attention in recent years. A generic L2O approach parameterizes the iterative update rule and learns the update direction as a black-box network. While the generic approach is widely applicable, the learned model can overfit and may not generalize well to out-of-distribution test sets. In this paper, we derive the basic mathematical conditions that successful update rules commonly satisfy. Consequently, we propose a novel L2O model with a mathematics-inspired structure that is broadly applicable and generalized well to out-of-distribution problems. Numerical simulations validate our theoretical findings and demonstrate the superior empirical performance of the proposed L2O model.

13:22
Transformers Meet Image Denoising: Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

ABSTRACT. Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model’s depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.

13:44
Advancing Frontiers in DEQs with Applications in OOD Generalization and Diffusion Models

ABSTRACT. Deep equilibrium models (DEQs) are weight-tied networks that solve for the infinite-depth fixed point in the forward pass and use implicit gradients for the backward pass. In this talk, we will explore some recent developments in applications of DEQs within two important domains: out-of-distribution (OOD) generalization and diffusion models. It has been observed that DEQs demonstrate strong upward generalization on harder OOD tasks compared to non-weight tied architectures. In the initial segment of this talk, we will look into a property of equilibrium models called path independence, and its affects on OOD upward generalization performance of equilibrium models at test-time. We will also discuss a metric to quantify path independence, and explore some experimental interventions to enhance path independence. Following that, we will shift our focus to applications of DEQs in diffusion models. Diffusion models have enabled remarkable advances in image and video synthesis. However, the iterative inference process of diffusion models is slow, and makes it difficult to differentiate through their sampling chains. We will look into ways to leverage DEQs for parallel sampling of diffusion models. Fruther, we will explore the use of implicit gradients to efficiently differentiate through diffusion chains.

14:06
Learning to Predict Equilibria in Convex Games

ABSTRACT. Systems of competing agents can often be modeled as games. Assuming rationality, the most likely outcomes are given by an equilibrium (e.g. a Nash equilibrium). In many practical settings, games are influenced by context, i.e. additional data beyond the control of any agent. Often the exact game mechanics are unknown, yet vast amounts of historical data consisting of (context, equilibrium) pairs are available, raising the possibility of learning a solver which predicts the equilibria given only the context. We introduce Nash Fixed Point Networks (N-FPNs), a class of neural networks that naturally output equilibria. Crucially, NFPNs employ a constraint decoupling scheme to handle complicated agent action sets while avoiding expensive projections. Our experiments show N-FPNs are capable of scaling to problems orders of magnitude larger than existing learned game solvers.

13:00-14:30 Session SunB 3
13:00
Neural Embedded Optimization over Sets

ABSTRACT. We present Neural Embedded Optimization over Sets, a novel framework for solving optimization problems whose objectives are set functions. The framework is particularly suited for set functions that are either computationally expensive to evaluate, or only available via black-box function calls, or only observable via a finite number of prior evaluations, such as historical data. Notably, these functions are not required to be submodular and their domains may be finite or even infinite sets. Central to our approach is the embedding of a permutationally invariant sparse neural network that serves as an efficient surrogate for such functions and its subsequent representation and optimization using mixed-integer programming techniques. This allows for the computationally efficient solution of an otherwise intractable problem while also accommodating additional decision variables and side constraints, thus further broadening its applicability. Computational experiments on a data-driven problem in last-mile delivery and a continuous location-allocation problem confirm the efficiency, versatility, and robustness of the framework across different problem complexities and structures.

13:22
A Bilevel Mixed-Integer Program for the Define-Combine Redistricting Procedure

ABSTRACT. Political redistricting, the process of redrawing electoral districts after each U.S. Census, may be modeled as a graph partitioning problem with contiguity (connectivity) and population balance constraints. Palmer, Schneer, and DeLuca propose a two-player redistricting game, the Define-Combine Procedure (DCP), and present heuristic results suggesting DCP produces fairer district plans that the one-player-decides baseline, which may involve gerrymandering. In this talk, we model DCP as a bilevel mixed-integer program, DCP-MIP, where the leader problem is balanced, connected graph partitioning and the follower problem is perfect matching. We discuss techniques for simplifying and solving DCP-MIP, including variable-fixing methods from the computational redistricting literature, and summarize progress toward solving real-world instances.

13:44
Solving an Advanced Available-to-Promise (AATP) Problem by Optimization and Meta-heuristics Techniques

ABSTRACT. Advanced Available-to-Promise (AATP) represents a strategic supply chain planning approach employed by companies to establish accurate delivery promise dates for customers. This method entails a dynamic assessment of available inventory and production capacity, enabling the determination of realistic delivery timelines across a rolling planning horizon. The problem is formulated as a mixed integer programming model. To overcome the complexities introduced by some constraints in the original formulation, Lagrange Relaxation (LR) is employed as a technique to alleviate challenging constraints. The iterative process of narrowing the gap between the relational problem and the primal problem is undertaken to approach the optimal solution. The solution procedure comprises three key modules: dual variable updating, Particle Swarm Optimization (PSO) meta-heuristics, and primal feasibility checking. The dual variable updating is addressed using a sub-gradient method, ensuring an efficient adjustment of these variables. Simultaneously, PSO is leveraged to solve the LR problem, and the resulting solution is then transmitted to the primal feasibility checking module to obtain a primal feasible solution attained by employing goal programming techniques. This multi-faceted approach combines mathematical precision with heuristic optimization methods, enhancing the overall effectiveness of the Advanced Available-to-Promise methodology in supply chain planning.

14:06
Inverse of the Gomory Corner Relaxation of Integer Programs

ABSTRACT. The inverse-feasible region of an integer programming (IP) problem--the set of objective vectors for which a given feasible solution is optimal--is extremely hard to characterize. Therefore, it can be advantageous to explore the inverse-feasible region of relaxations of the IP problem. The Gomory corner relaxation (GCR) relaxes the nonnegativity of variables in an optimal basis of the linear programming (LP) relaxation. We obtain a polyhedral representation of the inverse-feasible region of the GCR and provide an LP formulation of the inverse problem of the GCR, which finds an inverse-feasible vector with minimum distance from a given target vector.

13:00-14:30 Session SunB 4
13:00
On the Convergence of a Decomposition Algorithm for Nonlinear Two-Stage Optimization

ABSTRACT. We consider the theoretical convergence properties of an algorithm for nonlinear continuous two-stage optimization problems that is based on a novel log-barrier-based smoothing technique. We demonstrate how navie smoothing may cause the spurious nonconvexity and how to avoid this. We then explore how global convergence guarantees can be obtained for nonconvex instances by using a path-following SQP algorithm, and how fast local convergence can be obtained with extrapolation steps.

13:22
A Trust Region SQP Algorithm for Equality Constrained Optimization

ABSTRACT. This work introduces a step decomposition trust region Sequential Quadratic Programming (SQP) algorithm for solving equality-constrained optimization problems with errors in both the objective function and constraints. With delicate adaptation of classical trust region methods, the algorithm requires only knowledge of noise level in function evaluations and constrains and incur no additional computational overhead. We show that, with bounded errors in function and gradient evaluations, the algorithm converge to a stationarity neighborhood under very weak assumptions. More specifically, for non-singular problems, it achieves an optimal neighborhood; for singular instances, it reaches a stationarity neighborhood up to the noise level. This research enhances trust region SQP algorithms' practical applicability in noisy optimization scenarios. We provide numerical experiments to demonstrate that the algorithm handles noise effectively.

13:44
Zeroth-order Riemannian Averaging Stochastic Approximation Algorithms

ABSTRACT. We present Zeroth-order Riemannian Averaging Stochastic Approximation (Zo-RASA) algorithms for stochastic optimization on Riemannian manifolds. We show that Zo-RASA achieves optimal sample complexities for generating ϵ-approximation first-order stationary solutions using only one-sample or constant-order batches in each iteration. Our approach employs Riemannian moving-average stochastic gradient estimators, and a novel Riemannian-Lyapunov analysis technique for convergence analysis. We improve the algorithm’s practicality by using retractions and vector transport, instead of exponential mappings and parallel transports, thereby reducing per-iteration complexity. Additionally, we introduce a novel geometric condition, satisfied by manifolds with bounded second fundamental form, which enables new error bounds for approximating parallel transport with vector transport.

14:06
Parsimonious Trajectories in Gradient Descent: Emerging Low-dimensionality in Over-parameterized Models

ABSTRACT. Recent studies have shown that Gradient descent (GD) and its variants inherently favor “parsimonious” solutions, tending towards optimal solutions in a low-dimensional region even within overparameterized spaces. This paper explores the underlying mechanics of this phenomenon, establishing minimal requisite conditions under both convex and nonconvex settings. We demonstrate that certain GD variants, under these conditions, efficiently target parsimonious solutions. Intriguingly, we find that Perturbed Gradient Descent (PGD) that reaches second-order stationary points efficiently in the classic nonconvex settings underperforms in standard overparameterized nonconvex problems like matrix factorization. To address this, we introduce the Two-Sided Perturbed Gradient Descent (TSP-GD), an algorithm tailored for complex nonconvex optimization. TSP-GD achieves rapid convergence to second-order stationary points in the overparameterized regime, akin to PGD in low-dimensional spaces. This innovation in TSP-GD could be of independent interest to the nonconvex optimization community.

13:00-14:30 Session SunB 5
13:00
Riemannian Stochastic Variance-Reduced Cubic Regularized Newton Method

ABSTRACT. The talk presents a stochastic variance-reduced cubic regularized Newton algorithm to optimize the finite-sum problem over Riemannian manifolds. The proposed double-loop algorithm requires a full gradient and Hessian update at the beginning of the outer loop while it performs stochastic variance-reduced updates in the inner iterations. The complexity analysis of the proposed algorithm and its variant to obtain a second-order stationary solution will be presented. Finally, the proposed method is evaluated and compared numerically with other Riemannian second-order methods.

13:22
Forward-backward Gaussian variational inference via JKO in the Bures-Wasserstein Space

ABSTRACT. Variational inference ( VI) seeks to approximate a target distribution by an element of a tractable family of distributions. Of key interest in statistics and machine learning is Gaussian VI, which approximates π by minimizing the Kullback–Leibler (KL) divergence to π over the space of Gaussians which is a Riemannian manifold. In this work, we develop the (Stochastic) Forward-Backward Gaussian Variational Inference (FB–GVI) algorithm to solve Gaussian VI. Our approach exploits the composite structure of the KL divergence, which can be written as the sum of a smooth term (the potential) and a non-smooth term (the entropy) over the Bures–Wasserstein ( BW ) space of Gaussians endowed with the Wasserstein distance. For our proposed algorithm, we obtain state-of-the-art convergence guarantees when the target is log-smooth and log-concave, as well as the first convergence guarantees to first-order stationary solutions when the target is only log-smooth.

13:44
Tangent Subspace Descent on Quotient Manifolds

ABSTRACT. The tangent subspace descent method (TSD) extends the coordinate descent algorithm to manifold domains. The key insight underlying TSD is to draw an analogy between coordinate blocks in Euclidean space and tangent subspaces of a manifold. The core principle behind ensuring convergence of TSD for smooth functions is the appropriate choice of subspace at each iteration. In this talk, we will show that it is always possible to appropriately pick such subspaces on the broad class of quotient manifolds known as reductive homogeneous spaces. This class includes Grassmannians, flag manifolds, and the positive definite manifold when endowed with an appropriate geometry. We then exhibit the first instance of TSD on quotient manifold that is not a reductive homogeneous space, the manifold of fixed-rank positive semidefinite matrices. As a result of our developments we derive new and efficient methods for large-scale optimization on these domains.

14:06
Proximal Oracles for Optimization and Sampling

ABSTRACT. We consider convex optimization with non-smooth objective function and log-concave sampling with non-smooth potential (negative log density). In particular, we study two specific settings where the convex objective/potential function is either semi-smooth or in composite form as the finite sum of semi-smooth components. To overcome the challenges caused by non-smoothness, our algorithms employ two powerful proximal frameworks in optimization and sampling, namely, the proximal point framework and the alternating sampling framework (ASF), which uses Gibbs sampling on an augmented distribution. A key component of both optimization and sampling algorithms is the efficient implementation of the proximal map by the cutting-plane method. We establish the iteration-complexity of the proximal map in both semi-smooth and composite settings. We then propose an adaptive proximal bundle method for non-smooth optimization. The proposed method is universal, as it does not need any problem parameters as input. Furthermore, we develop a proximal sampling oracle that resembles the proximal map in optimization and establish its complexity using a novel technique (a modified Gaussian integral). Finally, we combine this proximal sampling oracle and ASF to obtain a Markov chain Monte Carlo method with non-asymptotic complexity bounds for sampling in semi-smooth and composite settings.

13:00-14:30 Session SunB 6
13:00
Maximizing and Satisficing in Multi-armed Bandits with Graph Information

ABSTRACT. Pure exploration in multi-armed bandits has emerged as an important framework for modeling decision making and search under uncertainty. In modern applications however, one is often faced with a tremendously large number of options and even obtaining one observation per option may be too costly rendering traditional pure exploration algorithms ineffective. Fortunately, one often has access to similarity relationships amongst the options that can be leveraged. In this paper, we consider the pure exploration problem in stochastic multi-armed bandits where the similarities between the arms is captured by a graph and the rewards may be represented as a smooth signal on this graph. In particular, we consider the problem of finding the arm with the maximum reward (i.e., the maximizing problem) or one that has sufficiently high reward (i.e., the satisficing problem) under this model. We propose novel algorithms GRUB (GRaph based UcB) and zeta-GRUB for these problems and provide theoretical characterization of their performance which specifically elicits the benefit of the graph side information. We also prove a lower bound on the data requirement that shows a large class of problems where these algorithms are near-optimal. We complement our theory with experimental results that show the benefit of capitalizing on such side information.

13:22
Branching Adaptive Surrogate Search Method for High-dimensional Optimization

ABSTRACT. We propose an adaptive search algorithm, Branching Adaptive Surrogate Search for Optimization (BASSO), for high-dimensional black-box optimization problems. BASSO combines branching and surrogate modeling for continuous and mixed integer optimization. The proposed algorithm partitions the feasible region sequentially into smaller sub-regions, and embeds an optimization strategy to balance exploration with exploitation. We propose to use Gaussian processes as a surrogate model to guide the optimization within a sub-region. We address scalability by incorporating an adaptation of sampled fictitious play. We present an initial finite-time performance analysis in terms of the number of function evaluations until reaching an epsilon-optimum. Numerical experiments on benchmark black-box problems are presented. We also demonstrate the applicability of BASSO on the coverage monitoring and falsification for Cyber-Physical Systems (CPS) simulation.

13:44
Package Queries for Massive-scale In-Database Decision Support

ABSTRACT. A package query over a table in a relational database system extends traditional database querying by returning a "package" (multiset of rows) that optimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Package queries are equivalent to integer linear programs (ILPs) where the number of decision variables equals the number of rows in the table, which can run to the billions. We describe Progressive Shading, a novel algorithm for approximately solving package queries that can scale efficiently to billions of rows and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original rows into homogeneous groups until the original relation is obtained. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes (columns) and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as kd-tree. We further optimize our system by replacing off-the-shelf optimization software with customized ILP and LP solvers that exploit the special ILP stucture to achieve high accuracy and orders-of-magnitude speedup.

14:06
Potential Games on Cubic Splines for Multi-Agent Motion Planning

ABSTRACT. For multi-agent systems with self-interested agents, motion planning methods must provide solutions that are stable for best-response dynamics. If this stability is violated, some agent can unilaterally improve their payoff while making the joint payoff arbitrarily worse. Many practically interesting multi-agent systems satisfy sufficient assumptions to model the system as a potential game, which guarantees existence of Nash Equilibrium (NE) and provides simple algorithms to compute one. Existing work that solves for NE motion plans in potential games iteratively solve non-convex optimization problems to compute best responses for each agent, providing significant computational challenges in scaling to large numbers of agents or long horizons. We prove simultaneous gradient descent using independent per-agent step sizes converges to (local) NE trajectories to enable an easier per-iteration algorithm, and further abstract the problem by parameterizing agent trajectories with natural cubic splines. The higher abstraction enables us to solve for local NE in experiments with up to 200 agents or a 50 minute horizon in the practically interesting domains of urban driving, cooperative UAV search and surveillance, and autonomous electric vertical take-off and landing vehicles (eVTOL).

13:00-14:30 Session SunB 7
13:00
Recycling Valid Inequalities for Robust Optimization

ABSTRACT. Valid inequalities play an important role to speed-up MIP solvers, in particular to solve combinatorial optimization problems. In robust combinatorial optimization the objective coefficients are subject to uncertainty and the performance of MIP solvers typically degrades significantly compared to the deterministic setting. For the case of budgeted uncertainty, we introduce the idea of recycling valid inequalities to obtain valid inequalities for the robust setting. The idea is based on the compact reformulation by dualization causing a weak linear relaxation. By considering a bilinear relaxation instead, we derive strong valid, linear inequaliites. These inequalities might be facet-defining even if the original valid inequality was not defining a facet of the convex hull. Ideas for recycling further valid inequalities that cannot be recycled by the original idea are presented as well. Computational experiments show the effect of the new inequalities on the performance of MIP solvers.

13:22
A Coordinated Parallelization Algorithm for Compromise Stochastic Programming

ABSTRACT. The Stochastic Decomposition (SD) algorithm has long history of addressing large-scale applications of stochastic programming (SP) problems. However, unlike many SP algorithms such as Stochastic Gradient Descent (SGD), the use of distributed clusters of computers in SD has not received much attention in the past. This is primarily due to the dynamic nature of the sample size. In this paper, we introduce a coordinated parallelization algorithm which is designed to integrate the idea of compromise decisions within the iterative framework of SD. This algorithm comprises multiple ingredients involving adaptive sequential sampling, employing a stochastic conjugate subgradient method for direction-finding, and a line-search technique to update decision variables. This approach is fundamentally different from traditional SD because the latter was designed in a manner similar to trust-region approaches whereas the new algorithm studied in this paper is based on a direction-finding approach. Through experimental analysis, we aim to demonstrate that our proposed algorithm not only retains but may surpass the scalability of the paralleled SGD algorithm. Furthermore, due to its use of compromise decisions in each iteration, it is expected to significantly reduce the variance of the decision variables and substantially improve both the speed and precision of the optimization process.

13:44
Branching in the Dual Decomposition Method for Stochastic Integer Programs

ABSTRACT. When solving the Dual Decomposition of a Mixed Integer Stochastic Programming Problem, most of the time we are not able to solve the original problem by just optimizing the root node, so branching is required to close the optimality gap. The choice of branching entity can have a huge impact on the size of the enumeration tree in MIP. We studied the impact of branching entity selection in dual decomposition for SMIP. We present a new branching method and two families of SMIP problem instances. We implement our new branching method and some other methods, and present extensive computational results to show that the new method brings a reduction in the number of MIP and nodes solved when solving SMIP.

14:06
A Unifying Theory for the Reliability of Stochastic Programming Solutions using Compromise Decisions

ABSTRACT. This paper studies the reliability of stochastic programming solutions using compromise decisions. A compromise decision is obtained by minimizing the aggregation of objective function approximations across the replications while regularizing the candidate decisions of all the replications, which we refer to as Compromise Decision problem. Rademacher average of families of functions is used to bound the sample complexity of the compromise decisions.

13:00-14:30 Session SunB 8
13:00
Accelerated Gradient Descent via Long Steps

ABSTRACT. This talk will discuss recent work establishing provably faster convergence rates for gradient descent in smooth convex optimization via semidefinite programming and computer-assisted analysis techniques. We do this by allowing nonconstant stepsize policies with frequent long steps potentially violating descent. This is managed by analyzing the overall effect of many iterations at once rather than the typical one-iteration inductions used in most first-order method analyses. We prove a O(1/T^{1.0564}) convergence rate, beating the classic O(1/T) rate simply by periodically including longer steps (no momentum needed!).

13:22
Beyond Discretization: Learning the Optimal Solution Path

ABSTRACT. We develop and analyze several different first-order algorithms for computing an approximately optimal solution path of a parameterized convex optimization problem. Example applications include regularized learning, weighted binary classification, multi-task learning algorithms like data-pooling, and multi-objective optimization. Our approach employs a prescribed set of basis functions and stochastic gradient descent (SGD) with randomization over parameters to fit a linear combination of basis functions that approximately minimizes the worst-case suboptimality with respect to the parametric objective function. We bound the total iteration complexity of our method when the target optimization (for a fixed parameter) is strongly convex and smooth. Loosely, when exact gradients for the target optimization are available, we fit an $\epsilon$-approximate solution path with $O(\log(1/\epsilon))$ iterations, whereas traditional discretization methods require at least $\tilde O(\epsilon^{-1/2})$ iterations. When only unbiased, stochastic gradients are available, we require $O(\epsilon^{-1})$ iterations whereas a traditional discretization requires $O(\epsilon^{-3/2})$ iterations. These complexity improvements are based on a novel extension of the analysis of SGD for interpolation problems to problems that are only slightly non-interpolable.

13:44
Advancing SGD: Exploring the Interplay of Stochastic Polyak Step-size and Momentum

ABSTRACT. Stochastic Polyak Step-size (SPS) and its variants emerge as new step-size selections for stochastic gradient descent (SGD) with favorable properties. The key benefits of SPS include its easy-to-tune step size without the need for unknown parameters such as smoothness and strong convexity, its universality, and its strong theoretical convergence guarantees.

On the other hand, momentum is a commonly used tool that practically improves the performance of SGD in the training of machine learning models. This talk will explore the relationships between SPS variants and momentum. We'll investigate whether using momentum in a straightforward manner is effective, the potential for an adaptive momentum parameter, and whether there are any theoretical or practical benefits to applying momentum alongside SPS.

14:06
Acceleration of Level Adjustment for the Polyak Stepsize: Applications to Convex and Non-Convex Problems

ABSTRACT. The Polyak stepsize rule has been widely used for convex problems. However, the stepsize requires a generally unknown optimal value. To guarantee convergence, we develop a novel level adjustment approach to successively estimate the optimal value with provable convergence. This is achieved by developing a decision-based procedure through a novel easy-to-solve "Polyak Stepsize Violation Detector" (PSVD) linear constraint satisfaction problem. Once a violation is detected, the level value is recalculated. Furthermore, because subgradients may be computationally costly, we develop an approximate subgradient method. Through a series of empirical tests of convex functions with diverse characteristics, we illustrate the practical advantages of our approach as compared to existing methods. The method has also been successfully utilized to efficiently optimize Lagrangian dual problems arising when solving discrete programs from OR, power systems, transportation, and manufacturing, thereby leading to speedups of 2-3 orders of magnitude as compared to standard MIP methods.

13:00-14:30 Session SunB 9
13:00
On graph sequences with finite-time consensus

ABSTRACT. In this talk, we study the finite-time consensus property of a sequence of graphs. Graph sequences with the finite-time consensus property have the desirable feature that iterating through the entire graph sequence is equivalent to performing global or exact averaging. We present several classes of graph sequences for which this seemingly restrictive property holds, including the static de Bruijn graphs, one-peer exponential graphs, one-peer hyper-cubes, p-peer hyper-cuboids, and base k+1 graphs. For each graph sequence, we give an explicit representation of the weight matrices, and present its potential usefulness in decentralized averaging.

13:22
On the use of the finite-time consensus property in decentralized optimization

ABSTRACT. In this talk, we consider the incorporation of sequences of graphs satisfying the finite-time consensus property in decentralized optimization. We first discuss existing strategies to solve the decentralized optimization problem and highlight their limitations in balancing the trade-off between computation and communication. Then, we detail why current analysis of certain decentralized algorithms fail to encapsulate the finite-time consensus property. Finally, we conclude by demonstrating how to leverage the finite-time consensus property in two existing decentralized optimization algorithms: Decentralized Gradient Descent (DGD) and Time-Varying Gradient Tracking (TV-GT). We give the convergence rate of DGD and TV-GT when using the finite-time consensus property under the case in which agents' objective functions are non-convex and agents only have access to stochastic gradient estimates of the gradient of their local objective function. Our convergence analysis indicates that the use of finite-time consensus graphs can lead to communication cost reduction compared to traditional strategies.

13:44
Dual distributed optimization algorithms for potentially disconnected time-varying networks

ABSTRACT. We present new distributed optimization algorithms for both the centralized (\ie, federated learning) and decentralized settings. The proposed methods use a dual oracle that has access to the gradient of the Fenchel conjugate of the individual loss functions. Previous work on dual distributed methods either considers a static topology or undirected, connected time-varying networks (in the decentralized setting), and cannot encapsulate client sampling in federated learning and the recently studied finite-time consensus graph sequences in decentralized optimization. The presented algorithms, however, can handle these two situations, and accelerated versions are proposed to further improve practical performance. Convergence guarantees are provided when the individual loss function is strongly convex, and numerical results are presented to validate the theoretical findings.

14:06
Understanding the Influence of Directed Networks on Decentralized Optimization: Effective Metrics, Lower Bound, and Optimal Algorithm

ABSTRACT. In this talk, we will investigate the influence of directed networks on the convergence of smooth non-convex stochastic decentralized optimization associated with column-stochastic mixing matrices. We find that the canonical spectral gap, a widely used metric in undirected networks, alone fails to adequately characterize the impact of directed networks. Through a new analysis of the Push-Sum strategy, a fundamental building block for decentralized algorithms over directed graphs, we identify another novel metric called equilibrium skewness. Next, we establish the first convergence lower bound for non-convex stochastic decentralized algorithms over directed networks, which explicitly manifests the impact of both the spectral gap and equilibrium skewness and justifies the imperative need for both metrics in analysis. Moreover, by jointly considering the spectral gap and equilibrium skewness, we present the state-of-the-art convergence rate for the Push-DIGing algorithm. Our findings reveal that these two metrics exert a more pronounced negative impact on Push-DIGing than suggested by our lower bound. We further integrate the technique of multi-round gossip to Push-DIGing to obtain MG-Push-DIGing, which nearly achieves the established lower bound, demonstrating its convergence optimality, best-possible resilience to directed networks, and the tightness of our lower bound. Experiments verify our theoretical findings.

13:00-14:30 Session SunB10
13:00
Dimension-Free Zeroth-Order Algorithms for Stochastic Programming Under Dimension-Independent Budget Conditions

ABSTRACT. This paper concerns a stochastic zeroth-order optimization (S-ZOO) problem under the assumption of a stochastic zeroth-order oracle. For modern literatures, the best known rate is O(d/\epsilon^2). While this complexity is provably optimal, it undesirably grows polynomially (linearly) in the problem dimensionality (the number of decision variables). With the rapidly inflating dimensionality in modern optimization problems, problems with millions of (and more) dimensions are no longer uncommon. As a result, a polynomial dependence in d grows to become insufficient for many modern applications. In this paper, we consider the question whether a less dimension-sensitive complexity can be achieved under reasonable structural conditions. This also allows a much wider class of S-ZOO problems with massive dimensionality to be solvable at reasonable computational effort.

13:22
Efficient and Scalable Reinforcement Learning via Hypermodel

ABSTRACT. Efficiency and scalability in complex environments remain pressing challenges in reinforcement learning (RL). Traditionally, Thompson Sampling have offered principled approaches to deep exploration in RL but fall short in terms of computational feasibility in complex environments due to their reliance on maintaining posterior distributions. The posterior is only tractable in simple environments with restrictive assumptions. A crucial aspect in modern RL is thus to develop an algorithm that is both data and computationally efficient, and scalable to complex environments. HyperFQI, our proposed framework, addresses these challenges head-on. It can be regarded as approximate Thompson Sampling, employing a hypermodel for efficient uncertainty estimation in the action-value function and thus enabling deep exploration in complex environments. HyperFQI also incorporates an incremental update mechanism, allowing for computationally efficient model updates, essential for compatibility with the demands of modern deep RL and scalability to complex and high-dimensional environments. Empirically, HyperFQI demonstrates its robust performance. It efficiently solves the DeepSea benchmark, illustrating its capability for deep exploration and scalability to large state space. Additionally, in the Atari benchmark, HyperFQI achieves super-human performance with just 2M interactions, highlighting its computational efficiency and scalability to high-dimensional environments. Theoretically, HyperFQI undergoes a rigorous regret performance analysis, affirming its computational and data efficiency. This positions HyperFQI as the first RL algorithm with provable and pratical efficiency in large-scale complex environment. This work bridges the theoretical and practical realms of RL, establishing a new benchmark for efficient and scalable RL algorithms.

13:44
On First-Order Meta-Reinforcement Learning with Moreau Envelopes

ABSTRACT. Meta-Reinforcement Learning (MRL) is a promising framework for training agents that can quickly adapt to new environments and tasks. In this work, we study the MRL problem under the policy gradient formulation, where we propose a novel algorithm that uses Moreau envelope surrogate regularizers to jointly learn a meta-policy that is adjustable to the environment of each individual task. Our algorithm, called Moreau Envelope Meta-Reinforcement Learning (MEMRL), learns a meta-policy that can adapt to a distribution of tasks by efficiently updating the policy parameters using a combination of gradient-based optimization and Moreau Envelope regularization. Moreau Envelopes provide a smooth approximation of the policy optimization problem, which enables us to apply standard optimization techniques and converge to an appropriate stationary point. We provide a detailed analysis of the MEMRL algorithm, where we show a sublinear convergence rate to a first-order stationary point for non-convex policy gradient optimization. We finally show the effectiveness of MEMRL on a multi-task 2D-navigation problem.

14:06
Bayesian optimization of risk based on Student-t process and extreme value statistics

ABSTRACT. We consider the problem of optimization of tail risk measures (particularly Conditional-Value-at-Risk, CVaR) for black-box functions. We follow the Bayesian optimization framework with a few variations compared to the traditional implementations. First, we employ the Student-t process instead of the commonly used Gaussian process as the prior in the Bayesian optimization framework. Since it places a larger emphasis on the extreme outliers, we surmise that Student-t process may be better suited for the task of optimizing risk. To this end, we provide a closed-form formula for the upper confidence bound acquisition function and then demonstrate how the Student-t process version of Bayesian optimization of risk can be organized. Secondly, we consider extreme value theory (EVT) as the basis for the estimation of the underlying risk measures. Specifically, we show how variance can be estimated in an efficient way to enable EVT-based risk surface estimation. We will then present some preliminary results with benchmark functions using CVaR as the objective.

13:00-14:30 Session SunB11
13:00
Relaxation \& Approximation Algorithm for Solving Sparse PCA Under Robust Uncertainty Set

ABSTRACT. Sparse Principal Component Analysis (PCA) is a fundamental dimensionality reduction tool that ensures interpretability in various high-dimensional settings. This paper investigates a robust generalization of the sparse PCA problem, wherein the covariance matrix is contained within a specified uncertainty set. Notably, existing related literature mainly focuses on the statistical and computational aspects of sparse PCA under adversarially perturbed samples, with a noticeable gap in the theoretical properties of the proposed problem. This paper begins with exploring the support recovery properties of the proposed problem. Then, we propose an efficient approximation algorithm for solving the proposed sparse PCA under a robust uncertainty set.

13:22
Robust Personalized Design via Bayesian Optimization under Covariate Uncertainty

ABSTRACT. The Bayesian Optimization framework for personalized design was proposed to optimize a functional design policy that assigns the optimal design based on human covariates. Our method seeks to minimize expected responses across a population, considering the inherent uncertainties in covariates’ measurement. Despite the noiseless setting in the computer simulation model, real applications of robust personalized design present significant challenges due to the high computational cost of modeling the complex system and the reliability concern of the potential differences between the selected and true optimal designs. The presence of measurement errors in covariates can also lead to discrepancies in the system response performance. To address these challenges, we propose a robust Bayesian Optimization framework to minimize the discrepancy in system response by imposing penalties on the variation of system responses induced by covariate uncertainties. An appropriate acquisition function is introduced to consider the mixed type of categorical and continuous design variables when searching for new samples. This robust design policy reduces the sensitivity of the optimal design policy against covariate uncertainties as well as maintains system performance. Both the numerical examples based on the synthetic data and vehicle restraint system design demonstrate the efficacy and robustness of the proposed method.

13:44
Forecasting Market-Suitable Vegetables by Using a Decision Tree Algorithm

ABSTRACT. One of the main sources of food losses occurs in the fruit and vegetable supply chain. These losses are mainly associated with poor operational management and coordination between the different stages of this chain. In this study, food losses are indirectly addressed in pre-harvest, that is, in the cultivation stage, using a decision tree algorithm. Thus, the algorithm predicts whether the vegetables planted on a farm will be suitable for a market. Consequently, vegetables that are not suitable for a market are considered as food losses. To evaluate the classification accuracy of the algorithm, it was applied to a case study, corresponding to a lettuce farm located in Chile. The results showed that the developed decision tree algorithm achieved reliable forecasts, even when analysing early stages of lettuce growth.

14:45-16:15 Session SunC 1
14:45
A Conditional Gradient-based Method for Simple Bilevel Optimization with Convex Lower-level Problem

ABSTRACT. In this talk, we consider a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex-constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, we introduce a novel bilevel optimization method that locally approximates the solution set of the lower-level problem via a cutting plane and then runs a conditional gradient update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We also prove stronger convergence guarantees under the Hölderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered class of bilevel problems.

15:07
Optimal Algorithms for Stochastic Bilevel Optimization under Relaxed Smoothness Conditions

ABSTRACT. Stochastic Bilevel optimization usually involves minimizing an upper-level (UL) function that is dependent on the arg-min of a strongly-convex lower-level (LL) function. Several algorithms utilize Neumann series to approximate certain matrix inverses involved in estimating the implicit gradient of the UL function (hypergradient). The state-of-the-art StOchastic Bilevel Algorithm (SOBA) instead uses stochastic gradient descent steps to solve the linear system associated with the explicit matrix inversion. This modification enables SOBA to match the lower bound of sample complexity for the single-level counterpart in non-convex settings. Unfortunately, the current analysis of SOBA relies on the assumption of higher-order smoothness for the UL and LL functions to achieve optimality. In this paper, we introduce a novel fully single-loop and Hessian-inversion-free algorithmic framework for stochastic bilevel optimization and present a tighter analysis under standard smoothness assumptions (first-order Lipschitzness of the UL function and second-order Lipschitzness of the LL function). Furthermore, we show that by a slight modification of our approach, our algorithm can handle a more general multi-objective robust bilevel optimization problem. For this case, we obtain the state-of-the-art oracle complexity results demonstrating the generality of both the proposed algorithmic and analytic frameworks. Numerical experiments demonstrate the performance gain of the proposed algorithms over existing ones.

15:29
SimFBO: Towards Simple, Flexible and Communication-efficient Federated Bilevel Learning

ABSTRACT. Federated bilevel optimization (FBO) has shown great potential recently in machine learning and edge computing due to the emerging nested optimization structure in meta-learning, fine-tuning, hyperparameter tuning, etc. However, existing FBO algorithms often involve complicated computations and require multiple sub-loops per iteration, each of which contains a number of communication rounds. In this paper, we propose a simple and flexible FBO framework named SimFBO, which is easy to implement without sub-loops, and includes a generalized server-side aggregation and update for improving communication efficiency. We further propose System-level heterogeneity robust FBO (ShroFBO) as a variant of SimFBO with stronger resilience to heterogeneous local computation. We show that SimFBO and ShroFBO provably achieve a linear convergence speedup with partial client participation and client sampling without replacement, as well as improved sample and communication complexities. Experiments demonstrate the effectiveness of the proposed methods over existing FBO algorithms.

15:51
On Penalty Methods for Bilevel Optimization

ABSTRACT. we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter σ > 0. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be O(σ)-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as O(σ)-approximation of the original BO, we propose a first-order algorithm that finds an ε-stationary solution by optimizing the penalty formulation with σ = O(ε). When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an ε-stationary point of the penalty function. We then discuss the iteration complexities to reach the stationary point under different gradient oracle assumptions.

14:45-16:15 Session SunC 2
14:45
Learning Optimal and Fair Policies for Online Allocation of Scarce Societal Resources from Data Collected in Deployment

ABSTRACT. We study the problem of allocating scarce societal resources of different types (e.g., permanent housing, deceased donor kidneys for transplantation, ventilators) to heterogeneous allocatees on a waitlist (e.g., people experiencing homelessness, individuals suffering from end-stage renal disease, Covid-19 patients) based on their observed covariates. We leverage administrative data collected in deployment to design an online policy that maximizes expected outcomes while satisfying budget constraints, in the long run. Our proposed policy waitlists each individual for the resource maximizing the difference between their estimated mean treatment outcome and the estimated resource dual-price or, roughly, the opportunity cost of using the resource. Resources are then allocated as they arrive, in a first-come first-serve fashion. We demonstrate that our data-driven policy almost surely asymptotically achieves the expected outcome of the optimal out-of-sample policy under mild technical assumptions. We extend our framework to incorporate various fairness constraints. We evaluate the performance of our approach on the problem of designing policies for allocating scarce housing resources to people experiencing homelessness in Los Angeles based on data from the homeless management information system. In particular, we show that using our policies improves rates of exit from homelessness by 1.9% and that policies that are fair in either allocation or outcomes by race come at a very low price of fairness.

15:07
Wasserstein Robust Classification with Fairness Constraints

ABSTRACT. Data analytics models and machine learning algorithms are increasingly deployed to support consequential decision-making processes, from deciding which applicants will receive job offers and loans to university enrollments and medical interventions. However, recent studies show these models may unintentionally amplify human bias and yield significant unfavorable decisions to certain groups. We propose a distributionally robust classification model with a fairness constraint that encourages the classifier to be fair in the equality of opportunity criterion. We use a type-$\infty$ Wasserstein ambiguity set centered at the empirical distribution to model distributional uncertainty and derive a conservative reformulation for the worst-case equal opportunity unfairness measure. We establish that the model is equivalent to a mixed binary conic optimization problem, which standard off-the-shelf solvers can solve. We propose a convex, hinge-loss-based model for large problem instances whose reformulation does not incur binary variables to improve scalability. Moreover, we also consider the distributionally robust learning problem with a generic ground transportation cost to hedge against the label and sensitive attribute uncertainties. We numerically examine the performance of our proposed models on five real-world datasets related to individual analysis. Compared with the state-of-the-art methods, our proposed approaches significantly improve fairness with negligible loss of predictive accuracy in the testing dataset.

15:29
Learning Fair Policies for Multi-stage Selection Problems from Observational Data

ABSTRACT. We consider the problem of learning fair policies for multistage selection problems from observational data. This problem arises in several high-stakes domains such as company hiring, loan approval, or bail decisions where outcomes (e.g., career success, loan repayment, recidivism) are only observed for those selected. We propose a multi-stage framework that can be augmented with various fairness constraints, such as demographic parity or equal opportunity. This problem is a highly intractable infinite chance-constrained program involving the unknown joint distribution of covariates and outcomes. Motivated by the potential impact of selection decisions on people’s lives and livelihoods, we propose to focus on interpretable linear selection rules. Leveraging tools from causal inference and sample average approximation, we obtain an asymptotically consistent solution to this selection problem by solving a mixed binary conic optimization problem, which can be solved using standard off-the-shelf solvers. We conduct extensive computational experiments on a variety of datasets adapted from the UCI repository on which we show that our proposed approaches can achieve an 11.6% improvement in precision and a 38% reduction in the measure of unfairness compared to the existing selection policy.

15:51
On a Fair and Risk-averse Urban Air Mobility Resource Allocation Problem Under Demand and Capacity Uncertainties

ABSTRACT. Urban Air Mobility (UAM) is an emerging air transportation mode to alleviate the ground traffic burden and achieve zero direct aviation emissions. Due to the potential economic scaling effects, the UAM traffic flow is expected to increase dramatically once implemented, and its market can be substantially large. To be prepared for the era of UAM, we study the Fair and Risk-averse Urban Air Mobility resource allocation model (FairUAM) under passenger demand and airspace capacity uncertainties for fair, safe, and efficient aircraft operations. FairUAM is a two-stage model, where the first stage is the aircraft resource allocation, and the second stage is to fairly and efficiently assign the ground and airspace delays to each aircraft provided the realization of random airspace capacities and passenger demand. We show that FairUAM is NP-hard even when there is no delay assignment decision or no aircraft allocation decision. Thus, we recast FairUAM as a mixed-integer linear program (MILP) and explore model properties and strengthen the model formulation by developing multiple families of valid inequalities. The stronger formulation allows us to develop a customized exact decomposition algorithm with both Benders and L-shaped cuts, which significantly outperforms the off-the-shelf solvers. Finally, we numerically demonstrate the effectiveness of the proposed method and draw managerial insights when applying FairUAM to a real-world network.

14:45-16:15 Session SunC 3
14:45
A Cycle Cancelling Heuristic for The Traveling Salesman Problem

ABSTRACT. The Traveling Salesman Problem (TSP) is one of the most studied problems in optimization. Since it is an NP-Hard problem, in practice, heuristics are used to find locally optimal tours, as opposed to a globally optimal tour. We developed a new algorithm that uses a connection between Minimum Cost Flow Problems (MCF) and the TSP to improve on a given local optimum.

MCF problems can be solved efficiently through linear programming or combinatorial algorithms such as cycle-cancelling. We investigated the potential of cycle-cancelling in the context of the TSP. Through a restriction of the search space of cycles to cancel for a given tour, practical results exhibit that only a low number of subtours is created, and a simple patching step suffices for a high success rate in improving a locally optimal tour.

15:07
Who Plays First?

ABSTRACT. We consider the problem of determining the socially-optimal strategy in a sequential game among a finite number of players where the order of play can change. We show that under reasonable assumptions on players' payoff functions, a popular distributed control approach called sequential trajectory planning solves a non-cooperative dynamic game and leads to a local Stackelberg equilibrium. We formulate this problem as a mixed-integer optimization problem, and we introduce Branch-and-Play, an efficient algorithm to solve the problem by implicitly enumerating the permutation of the order of play. Computational experiments and simulations with quadrotors and autonomous vehicles prove Branch-and-Play can reliably determine the order of play and provide informative solutions for planners, regulators, and players.

15:29
Flexible Resource Job Scheduling - A Mixed Integer Programming Approach

ABSTRACT. This paper studies the minimization of makespan in job scheduling with variable machine speeds. In this problem, allocation of a resource to a machine increases its efficiency, thus reducing its processing time. We propose both a natural nonlinear disjunctive optimization model and a time-indexed mixed-integer optimization model for the problem. We show how the time-indexed model can be used to provide a lower bound on the makespan by coarsening the time discretization, and we show how to transform the coarsened solution into a feasible solution of high quality via the solution of a linear program. Computational results are presented on benchmark job-shop scheduling instances and for a commercial application with a hierarchical, two-level bill-of-materials work flow.

15:51
Handling polyhedral symmetries with a dedicated Branch&Cut: application to a knapsack variant

ABSTRACT. In this paper, we define a new variant of the knapsack problem, the Symmetric-weight Chain Precedence Knapsack problem (SCPKP). The (SCPKP) is the core structure of the Hydro Unit Commitment problem, the latter being a production scheduling problem relative to hydroelectric plants. The (SCPKP) is shown to be NP-hard. Polyhedral symmetries, featured by the (SCPKP), are introduced as a generalization of the classical symmetries applying only on the constraints without restricting the values of two symmetric solutions to be equal. A polyhedral study focuses on inequalities with 0-1 coefficients. Necessary facet-defining conditions are described through a new structure, called pattern, encoding the polyhedral symmetries of the (SCPKP). A dedicated two-phase Branch & Cut scheme is defined to exploit the symmetries on the pattern inequalities. Experimental results demonstrate the efficiency of the proposed scheme in particular with respect to default CPLEX and a family of cover inequalities related to the Precedence Knapsack Problem.

14:45-16:15 Session SunC 4
14:45
Stochastic Lagrangian-based method for nonconvex optimization with nonlinear constraints

ABSTRACT. The Augmented Lagrangian Method (ALM) is one of the most common approaches for solving linear and nonlinear constrained problems. However, for nonconvex objectives, the handling of nonlinear inequality constraints remains challenging. In this paper, we propose a stochastic ALM with Backtracking Line Search that performs on a subset (mini-batch) of randomly selected points for the solving of nonconvex problems. The considered class of problems include both nonlinear equality and inequality constraints. Together with the formal proof of the convergence properties (in expectation) of the proposed algorithm and its computational complexity, the performance of the proposed algorithm are then numerically compared against both exact and inexact state-of-the-art ALM methods. Further, we apply the proposed stochastic ALM method to solve a multi-constrained network design problem. We perform extensive numerical executions on a set of instances extracted from SNDlib to study its behavior and performance as well as potential improvement of this method. Analysis and comparison of the results against those obtained by extending to nonlinear constraints the method developed in [Contardo2021] for the approximation of separable nonconvex optimization programs are then provided.

15:07
A NEAR-OPTIMAL METHOD FOR CONSTRAINED COMPOSITE NON-CONVEX NON-SMOOTH PROBLEMS

ABSTRACT. This paper studies first-order methods (FOMs) for solving composite non-convex non-smooth optimization with linear constraints. Recently, the lower complexity bounds of FOMs on finding a (near) ϵ-stationary point of the considered problem is established. Note however that the optimization algorithms that can achieve this lower bound had never been developed. In this paper, we present a primal-dual recover method for solving the aimed problem. The oracle complexity of the proposed method, to find a (near) ϵ-stationary point of the considered problem, matches the lower bounds up to a logarithmic factor. Therefore, our proposed method is almost non-improvable. We illustrate the advantages of our proposed algorithm over existing methods in the numerical experiments.

15:29
Solving nonmonotone inclusions and Application to nonconvex QP with equality constraints

ABSTRACT. In general, an monontone inclusion problem in a real Hilbert space H refers to finding a zero of a set-valued maximally monotone operator A, i.e., the zero set zer(A). The term "warped proximal iteration" has been recently introduced in as generalization of the proximal point algorithm for finding a zero point of a maximally monotone operator A acting on H. The maximal monotonicity of A restricted its applicability to convex optimization problems. On the other hand, solving the general nonmonotone inclusion, i.e., the inclusion where the operator A is nonmonotone, remains an open and challenging research problem. Indeed, compared to the case where the operator A is (maximally) monotone, the solving of the nonmonotone inclusion problem becomes considerably more complicated when the operator A fails to be monotone. From this perspective, the first objective of this paper is to extend the notion of parametric ρ-hypomotone to (ρV,W)−hypomonotone operators, where V and W are single-valued operators with full domains, i.e, dom(W) = dom(V) = H, the second to investigate the convergence property of the warped proximal iteration for solving nonmonotone inclusions and the third apply this algorithm for (constrained) nonconvex optimization problems. In particular, we place our attention to the finding of KKT points for nonconvex quadratic programming problems with equality constraints.

15:51
A Parameter-free and Projection-free Restarting Level Set Method for Constrained Convex Optimization Under the Error Bound Condition}

ABSTRACT. Recent efforts to accelerate first-order methods have focused on convex optimization problems that satisfy a geometric property known as error-bound condition, which covers a broad class of problems, including piece-wise linear programs and strongly convex programs. Parameter-free first-order methods that employ projection-free updates have the potential to broaden the benefit of acceleration. Such a method has been developed for unconstrained convex optimization but is lacking for constrained convex optimization. We propose a parameter-free level-set method for the latter constrained case based on projection-free subgradient decent that exhibits accelerated convergence for problems that satisfy an error-bound condition.

14:45-16:15 Session SunC 5
14:45
Certification of sums of squares via low-rank optimization

ABSTRACT. To certify a sum of k squares on a real projective variety, one can minimize the distance of the sum of squares of k linear forms from it in the space of quadrics. When k is smaller than the dimension of linear forms, the certification problem can be applied in low-rank semidefinite relaxation of polynomial optimization, similar to the Burer-Monteiro method. We discuss the existence of spurious local minima in this nonconvex certification problems, and show that in some interesting cases, there is no spurious local minima, or any spurious local minimum would lie on the boundary of the sum-of-square cone. These characterizations could potentially lead to efficient algorithms for polynomial and combinatorial optimization.

15:07
Revisiting Spectral Bundle Methods: Primal-dual (Sub)linear Convergence Rates

ABSTRACT. The spectral bundle method proposed by Helmberg and Rendl is well established for solving large-scale semidefinite programs (SDP) thanks to its low per iteration computational complexity and strong practical performance. In this paper, we revisit this classic method showing it achieves sublinear convergence rates in terms of both primal and dual SDPs under merely strong duality, complementing previous guarantees on primal-dual convergence. Moreover, we show the method speeds up to linear convergence if (1) structurally, the SDP admits strict complementarity, and (2) algorithmically, the bundle method captures the rank of the optimal solutions. Such complementary and low rank structure is prevalent in many modern and classical applications. The linear convergent result is established via an eigenvalue approximation lemma which might be of independent interest. Numerically, we confirm our theoretical findings that the spectral bundle method, for modern and classical applications, speeds up under these conditions. Finally, we show that the spectral bundle method combined with a recent matrix sketching technique is able to solve an SDP with billions of decision variables in a matter of minutes.

15:29
Convex Relaxation for Fokker-Planck

ABSTRACT. We propose an approach to directly estimate the moments or marginals for a high-dimensional equilibrium distribution in statistical mechanics, via solving the high-dimensional Fokker-Planck equation in terms of low-order cluster moments or marginals. With this approach, we bypass the exponential complexity of estimating the full high-dimensional distribution and directly solve the simplified partial differential equations for low-order moments/marginals. Moreover, the proposed moment/marginal relaxation is fully convex and can be solved via off-the-shelf solvers. We further propose a time-dependent version of the convex programs to study non-equilibrium dynamics. We show the proposed method can recover the meanfield approximation of an equilibrium density. Numerical results are provided to demonstrate the performance of the proposed algorithm for high-dimensional systems.

14:45-16:15 Session SunC 6
14:45
A Quantile Neural Network Framework for Two-stage Stochastic Optimization

ABSTRACT. Two-stage stochastic programming is a popular framework for optimization under uncertainty, where decision variables are split between first-stage decisions, and second-stage (or recourse) decisions, where the later can be adjusted after uncertainty is realized. These problems are often formulated using Sample Average Approximation (SAA), where uncertainty is modeled as a finite set of scenarios, resulting in a large deterministic problem, i.e., where the model is repeated for each scenario. The resulting problems can be challenging to solve, and several decomposition approaches have been proposed. More recently, Dumouchelle et al. (2022) approximate the expected second-stage objective value for a set of scenarios using a neural network, which can then be embedded in the first-stage problem to produce good heuristic solutions. In this work, we propose approximating the second-stage objective value with a quantile neural network, which can capture uncertainty and is not limited to expected-value optimization, e.g., to optimize the Conditional Value at Risk (CVaR). We discuss optimization formulations for embedding the quantile neural network and demonstrate the effectiveness of the proposed framework using several computational case studies including mixed-integer and nonlinear problems.

15:07
A Differentiable Decision Tree Induction Algorithm with Near-optimality

ABSTRACT. The efficient induction of decision trees is an ongoing focus in machine learning, driven by their notable interpretability. Traditional decision tree induction techniques, such as greedy algorithms and evolutionary algorithms, inevitably suffer from suboptimal performance. Meanwhile, global-optimal inductions are often rendered impractical due to their substantial computational cost. To address these challenges, we propose a scalable decision tree induction algorithm that employs a moving-horizon strategy, designed to achieve near-optimal solutions. Our work first approximates the decision tree learning problem as a differentiable unconstrained nonlinear optimization problem. To enhance computational efficiency, we integrate GPU-accelerated implementation frameworks like PyTorch, leveraging the inherent decomposable nature of the objective function and automatic differentiation. This augmentation significantly boosts the scalability of the decision tree induction process, rendering our method highly applicable to large-scale industry data analytics. Furthermore, we adopt a moving-horizon strategy, iteratively training shallow subtrees at each branch node to strike a balance between solution time and optimality. Extensive computational studies on 20 machine learning datasets affirm superior accuracy compared to greedy heuristic methods.

15:29
Data-driven Decision-making with the Optimization & Machine Learning Toolkit (OMLT)

ABSTRACT. Increased interest in integrating machine learning and optimization has warranted development of a framework to represent data driven models in algebraic modeling languages. In this presentation, we provide updates on the Optimization and Machine Learning Toolkit (OMLT) which integrates trained neural networks, gradient boosted trees, and linear model decision trees in Pyomo. As a demonstration, we include two case studies with additional insight into the computational performance of different mixed-integer programming formulations of these ML models.

15:51
Decomposition and adaptive sampling for data-driven inverse linear optimization

ABSTRACT. This work addresses inverse linear optimization, where the goal is to infer the unknown cost vector of a linear program. Specifically, we consider the data-driven setting in which the available data are noisy observations of optimal solutions that correspond to different instances of the linear program. We introduce a new formulation of the problem that, compared with other existing methods, allows the recovery of a less restrictive and generally more appropriate admissible set of cost estimates. It can be shown that this inverse optimization problem yields a finite number of solutions, and we develop an exact two-phase algorithm to determine all such solutions. Moreover, we propose an efficient decomposition algorithm to solve large instances of the problem. The algorithm extends naturally to an online learning environment where it can be used to provide quick updates of the cost estimate as new data become available over time. For the online setting, we further develop an effective adaptive sampling strategy that guides the selection of the next samples. The efficacy of the proposed methods is demonstrated in computational experiments involving two applications: customer preference learning and cost estimation for production planning. The results show significant reductions in computation and sampling efforts.

14:45-16:15 Session SunC 7
14:45
Robust Data-driven Prescriptiveness Optimization

ABSTRACT. The emergence of various optimization techniques, which leverage available side information, has motivated the development of a universal measure of performance called the coefficient of prescriptiveness. This coefficient quantifies both the quality of contextual decisions compared to a reference one and the prescriptive power of side information. To identify policies that maximize the former in a data-driven context, we introduce a distributionally robust contextual optimization model where the coefficient of prescriptiveness substitutes for the classical empirical risk minimization objective. We present a bisection algorithm to solve this model and study a contextual shortest path problem to evaluate the robustness of the resulting policies against alternative methods when the out-of-sample dataset is subject to a distribution shift.

15:07
Properties of Two-Stage Stochastic Multiobjective Linear Programs

ABSTRACT. To address decision making under conflict and uncertainty, we consider a two-stage stochastic multiobjective linear program (TSSMOLP). As a generalization of its single-objective counterpart, the two-stage stochastic linear program, the TSSMOLP has more than one conflicting objective in each stage. Thus, in the second stage, rather than solving a single-objective linear program resulting in a real optimal value, we instead solve a multiobjective linear program resulting in a nondominated set. In the first stage, a set dominance concept is applied to the collection of nondominated sets and yields a global Pareto set for the TSSMOLP. Solving the TSSMOLP requires (a) developing a theory to estimate the expected value of the second-stage nondominated set, and (b) proving properties of the TSSMOLP. We discuss structural properties of TSSMOLPs and recent theoretical advances for solving TSSMOLPs using a sample average approximation.

15:29
Optimizing infrastructure system monitoring under stochastic decision-dependent system failures

ABSTRACT. Modern society relies on infrastructure systems such as power grids, where aged components become unreliable, possibly triggering cascading failures, and thus should be monitored closely. We consider the problem of dispatching multiple maintenance fleets (e.g., drones) to examine different parts of a system, whose failures are probabilistic and also depend on inspection schedules and fleets' arrival time. The goal is to balance the fleets’ routing and operational cost with the uncertain cost of system failure and maintenance. We model a two-stage stochastic mixed-integer program with decision-dependent uncertainty and solve it via a stochastic decomposition algorithm. Moreover, we speed up the computation of subproblems using column generation and random coloring algorithms. We demonstrate the numerical performance of our approach by testing the problem on the IEEE 33-bus system.

15:51
A distributed algorithm for multi-scale multi-stage stochastic programs with applications to electricity capacity expansion

ABSTRACT. Progressive hedging (PH) has been a popular method for solving multi-stage stochastic programs (SPs), which can be viewed as a 2-block ADMM algorithm. For convex SPs with block-separable properties, such as capacity expansion problems from power system planning where the models can be naturally decomposed by different spatial and temporal resolutions, ADMM-type algorithms may result in nested loops and are not amenable to parallel computing.

In this work, we propose an algorithm framework based on the predictor-corrector proximal method (PCPM) that can decompose the blocks and scenarios all together in one loop and is amenable to parallel computing. We show convergence of the algorithm for N-block convex multi-stage SPs with block-separable properties with nonlinear coupling constraints. We further extend the PCPM algorithm to an asynchronous iterative scheme, where a maximum tolerable delay is allowed for each distributed unit, and show favorable numerical performance compared with the PH method.

14:45-16:15 Session SunC 8
14:45
On Polynomial-Time Solvability of Combinatorial Markov Random Fields

ABSTRACT. The problem of inferring Markov random fields (MRFs) with a sparsity or robustness prior can be naturally modeled as a mixed-integer program. This motivates us to study solution methods for a general class of convex submodular optimization problems with indicator variables, which we show to be polynomially solvable in this paper. The key insight is that, possibly after a suitable reformulation, indicator constraints preserve submodularity. Fast computations of the associated Lovasz extensions are also discussed under certain smoothness conditions, and can be implemented using only linear-algebraic operations in the case of quadratic objectives.

15:07
Constrained Optimization of Low-rank Functions with Indicator Variables

ABSTRACT. Optimization problems involving minimization of low-rank convex functions over constraints modeling restrictions on the support of the decision variables emerge in various machine learning applications. These problems are often modeled with indicator variables for identifying the support of the continuous variables. In this paper we investigate compact extended formulations for such problems through perspective reformulation techniques. In contrast to the majority of previous work that relies on support function arguments and disjunctive programming techniques to provide convex hull results, we propose a constructive approach that exploits a hidden conic structure induced by perspective functions. To this end, we first establish a convex hull result for a general conic mixed-binary set in which each conic constraint involves a linear function of independent continuous variables and a set of binary variables. We then demonstrate that extended representations of sets associated with epigraphs of rank-one convex functions over constraints modeling indicator relations naturally admit such a conic representation. This enables us to systematically give perspective formulations for the convex hull descriptions of these sets with nonlinear separable or non-separable objective functions, sign constraints on continuous variables, and combinatorial constraints on indicator variables. We illustrate the efficacy of our results on sparse nonnegative logistic regression problems.

15:29
Higher order graphical models and binary polynomial optimization

ABSTRACT. Higher order graphical models are distinguished by their accommodation of complex interactions among variables within cliques of large size. We formulate the problem of inference on graphical models as a binary polynomial optimization problem. By studying the facial structure of the corresponding multilinear polytope, we obtain several classes of facet-defining inequalities for the convex hull of the feasible region, which is crucial for constructing strong Linear Programming (LP) relaxations for the inference problem. Additionally, we provide a theoretical comparison of our LP relaxations with existing relaxations. Numerical experiments for two important applications, namely, image restoration in computer vision, and error-control coding in digital communications are presented.

15:51
Algorithms for DC Programming via Polyhedral Approximations of Convex Functions

ABSTRACT. There is an existing exact algorithm that solves DC programming problems if one component of the DC function is polyhedral convex [Löhne and Wagner 2017]. Motivated by this, first, we consider two cutting-plane algorithms for generating an $\epsilon$-polyhedral underestimator of a convex function g. The algorithms start with a polyhedral underestimator of g and the epigraph of the current underestimator is intersected with either a single halfspace (Algorithm 1) or with possibly multiple halfspaces (Algorithm 2) in each iteration to obtain a better approximation. We prove the correctness and finiteness of both algorithms, establish the convergence rate of Algorithm 1, and show that after obtaining an $\epsilon$-polyhedral underestimator of the first component of a DC function, the algorithm from [Löhne and Wagner 2017] can be applied to compute an $\epsilon$-solution of the DC programming problem without further computational effort. We then propose an algorithm (Algorithm 3) for solving DC programming problems by iteratively generating a (not necessarily $\epsilon$-) polyhedral underestimator of g. We prove that Algorithm 3 stops after finitely many iterations and it returns an $\epsilon$-solution to the DC programming problem. Moreover, the sequence $\{x_k\}_{k≥0}$ outputted by Algorithm 3 converges to a global minimizer of the DC problem when $\epsilon$ is set to zero. Computational results based on some test instances from the literature are provided.

14:45-16:15 Session SunC 9
14:45
Generalization Performances in Nonconvex Minimax Optimization: A Comprehensive Study

ABSTRACT. In this work we take an initial step to systematically investigate the generalization bounds of algorithms for solving nonconvex-(strongly)-concave (NC-SC / NC-C) stochastic minimax optimization measured by the stationarity of primal functions, which fits better with the intrinsic nonconvex structure. We first establish algorithm-agnostic generalization bounds via uniform convergence between the empirical minimax problem and the population minimax problem, and we provide the sample complexities for achieving generalization for both NC-SC and NC-C settings, respectively. We further study the algorithm-dependent generalization bounds via stability arguments of algorithms. In particular, we introduce a novel stability notion for minimax problems and build a connection between generalization bounds and the stability notion. As a result, we establish algorithm-dependent generalization bounds for the stochastic gradient descent ascent (SGDA) algorithm and the more general sampling-determined algorithms (SDA).

15:07
Improving Autoencoder Image Interpolation via Dynamic Optimal Transport

ABSTRACT. Autoencoders are important generative models that, among others, have the ability to interpolate image sequences. However, interpolated images are usually not semantically meaningful. In this paper, motivated by dynamic optimal transport, we consider image interpolation as a mass transfer problem and propose a novel regularization term to penalize non-smooth and unrealistic changes in the interpolation result. Specifically, we define the path energy function for each path connecting the source and target images. The autoencoder is trained to generate the $L^2$ optimal transport geodesic path when decoding a linear interpolation of their latent codes. With a simple extension, this model can handle complicated environments, such as allowing mass transfer between obstacles and unbalanced optimal transport. A key feature of the proposed method is that it is physics-driven and can generate robust and realistic interpretation results even when only very limited training data are available.

15:29
ALEXR: Optimal Single-Loop Algorithms for Convex Finite-Sum Coupled Compositional Stochastic Optimization

ABSTRACT. This paper revisits a class of convex Finite-Sum Coupled Compositional Stochastic Optimization (cFCCO) problems with many applications, including group distributionally robust optimization (GDRO), reinforcement learning, and learning to rank. To better solve these problems, we introduce a unified family of efficient single-loop primal-dual block-coordinate proximal algorithms, dubbed ALEXR. This algorithm leverages block-coordinate stochastic mirror ascent updates for the dual variable and stochastic proximal gradient descent updates for the primal variable. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which not only improve the best rates in previous works on smooth cFCCO problems but also expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO. Finally, we present lower complexity bounds to demonstrate that the convergence rates of ALEXR are optimal among first-order block-coordinate stochastic algorithms for the considered class of cFCCO problems.

15:51
Dimension-Insensitive Stochastic First-Order Methods for Nonconvex Optimization

ABSTRACT. When the nonconvex problem is complicated by stochasticity, the sample complexity of stochastic first-order methods may depend linearly on the problem dimension, which is undesirable for large-scale problems. In this work, we propose dimension insensitive stochastic first-order methods (DISFOMs) to address nonconvex optimization with expected-valued objective function. Our algorithms allow for non-Euclidean and nonsmooth distance functions as the proximal terms. Under mild assumptions, we show that DISFOM using minibatches to estimate the gradient enjoys sample complexity of $ O ( (log d) / \epsilon^4 ) $ to obtain an $\epsilon$-stationary point. Furthermore, we prove that DISFOM employing variance reduction can sharpen this bound to $O( (log d)^{2/3}/\epsilon^{10/3} )$, which perhaps leads to the best-known sample complexity result in terms of $d$. We provide two choices of the nonsmooth distance functions, both of which allow for closed-form solutions to the proximal step. Numerical experiments are conducted to illustrate effectiveness of the proposed frameworks.

14:45-16:15 Session SunC10
14:45
Scalable Projection-Free Optimization Methods via MultiRadial Duality Theory

ABSTRACT. Recent works have developed new projection-free first-order methods based on utilizing radial linesearches and normal vector computations to maintain feasibility. These oracles are often cheaper than orthogonal projection or linear optimization subroutines but have the drawback of requiring a known strictly feasible point to do these linesearches with respect to. In this work, we develop new theory and algorithms which can operate using these cheaper linesearches while only requiring knowledge of points strictly satisfying each constraint separately. Convergence theory for several resulting ``multiradial'' gradient methods is established. We additionally provide numerics based on synthetic quadratically constrained quadratic programming problems as preliminary evidence for the potential practical effectiveness of such methods.

15:07
Binary Integer Program Approximation and the Induced Structural Duality

ABSTRACT. We present a generic reformulation framework for binary integer programs (BIPs) without imposing additional specifications for the objective function or constraints. To facilitate this level of generality, we introduce a set system approximation theory designed to identify the tightest inner and outer approximations for any binary solution space using special types of set systems. This development leads to an exact reformulation framework for general BIPs, centered around set covering and subtour elimination inequalities. We investigate the implications of this methodology on various instance problems, uncovering new solution strategies and structural insights. Building upon these advancements, we also extend the classic max-flow min-cut theorem into a broader context of set system duality.

15:29
Rectified Relaxation: A new learning approach for linear optimization with linear and complementary constraints

ABSTRACT. Linear optimization with linear and complementary constraints (LOLCC) arise frequently from various applications in engineering and economics. It covers several optimization models such as bilinear optimization and bi-level linear/quadratic optimization that are known to be NP-hard. Even finding a feasible solution to LOLCC is NP-hard. To address LOLCC, we consider the issue of how to determine whether a system of linear and complementary constraints (SLCC) is feasible. For this, we first establish the so-called universal relaxation theory, which shows that determining the feasibility of an SLCC is equivalent to solving some bilinear optimization problem with linear constraints. A deep learning approach, called rectified relaxation (ReCR), is introduced to solve the constructed bilinear optimization problem. We show that ReCR can skip the trap of the local minimum of the underlying bilinear optimization problem and converges directly to its global optimal solution. Moreover, ReCR provides a fully polynomial time approximation scheme (FPTAS) to the underlying bilinear optimization problem. Finally, we combine ReCR with the classical bisection search to develop a new FPTAS for the original LOLCC. Preliminary numerical experiments are conducted to test the efficacy of the new DL approach.