1. Lecture 1: Introduction and Convexity🔗

In a high-level, optimization packages a decision task into three objects: a variable x, a feasible set \Omega, and an objective f. The feasible set records what is allowed; the objective records what is preferred. This language is broad enough to cover resource allocation, profit maximization, training a machine learning model, maximum-likelihood estimation, control, and minimum-energy problems. Once a model is written this way, the basic question becomes: among all feasible choices, which one has the smallest cost? In this course, we focus on the continuous optimization problems where the domain E is a finite-dimensional real space.

Definition

1.1 Optimization problem

Let f : E \to \mathbb{R} and let \Omega \subseteq E be nonempty. The optimization problem associated with (f,\Omega) is

p^\star := \inf\{f(x) : x \in \Omega\}.

A point x^\star \in \Omega is a global minimizer if f(x^\star) = p^\star, i.e., x^\star = \arg\min_{x \in \Omega} f(x).

Definition

1.2 Approximate optimality

Let (f,\Omega) be an optimization problem, and let

p^\star := \inf\{f(x) : x \in \Omega\}.

For every \varepsilon > 0, a point \widehat x \in \Omega is called \varepsilon-optimal if

f(\widehat x) \le p^\star + \varepsilon.

Basic outcomes of an optimization problem. Even at this level, several different things can happen. The feasible set may be empty, the infimum may be finite but not attained, or the problem may be unbounded below. This is one reason a near-optimal solution concept is introduced: exact minimizers may fail to exist, and even when they do exist they may be harder to compute than near-optimal points. Before discussing certificates of optimality, it is therefore natural to ask a more basic question: when does an exact minimizer exist at all? The next theorem gives the standard topological answer under compactness and continuity.

Theorem

1.1 Weierstrass

Let \Omega \subseteq E be nonempty and compact, and let f : \Omega \to \mathbb{R} be continuous. Then there exists x^\star \in \Omega such that

f(x^\star) = \min_{x \in \Omega} f(x).

Proof

Because \Omega is compact and f is continuous on \Omega, the image set f(\Omega) \subseteq \mathbb{R} is compact. In particular, f(\Omega) is nonempty and closed and bounded below, so it contains its infimum. Choose m \in f(\Omega) such that

m = \inf f(\Omega).

Then there exists x^\star \in \Omega with f(x^\star) = m. For every x \in \Omega one has f(x) \in f(\Omega) and therefore m \le f(x). Hence

f(x^\star) = m = \min_{x \in \Omega} f(x).

Formal Statement and Proof

Lean theorem: Lecture01.thm_l1_weier.

theorem proof_of_Lecture01_thm_l1_weier {E : Type*} [TopologicalSpace E]
    {Ω : Set E} {f : E → ℝ}
    (hΩ_compact : IsCompact Ω)
    (hΩ_nonempty : Ω.Nonempty)
    (hf : ContinuousOn f Ω) :
    ∃ xStar ∈ Ω, IsMinOn f Ω xStar := byE:Type u_1inst✝:TopologicalSpace EΩ:Set Ef:E → ℝhΩ_compact:IsCompact ΩhΩ_nonempty:Ω.Nonemptyhf:ContinuousOn f Ω⊢ ∃ xStar ∈ Ω, IsMinOn f Ω xStar
  exact hΩ_compact.exists_isMinOn hΩ_nonempty hfAll goals completed! 🐙

Specification versus computation. An optimization specification tells us what the mathematical problem is: it determines the feasible set, the objective, the optimal value, and the solution notions we care about, such as exact minimizers or \varepsilon-optimal points. But this still does not determine a computational problem. To speak about algorithms and complexity, we must also specify how the instance is presented, which primitive operations are available, what cost model is being counted, and what kind of output is required.

For finitely described models such as linear and quadratic programs, the input is a finite list of numbers and one counts arithmetic or bit operations. For large language models pretraining in practice, the input is a huge text corpus and code for model architecture, and one measures the total training wall-clock time on a GPU cluster. The same mathematical specification can therefore lead to very different computational questions under different access models. For example, the computational question will be very different for LLM pretraining if the GPU cluster is not owned but rented and the cost is actually the rental cost rather than the wall-clock time. Even for simple problems that admit closed-form solutions, the computational question will be very different if our unit-cost arithmetic operations are finite-precision or infinite-precision. There are indeed lots of algorithms that are preferred because of their numerical stability.

For the purpose of this course, we will mostly focus on an abstract and therefore general setting, where we assume the query oracle about local information, such as the value, gradient, or Hessian of the objective at each point, is the relatively cheap primitive. Throughout the course we assume we can work with real numbers with infinite precision.

Definition

1.3 Differentiability on an open set

Let U \subseteq E be open and let f : U \to \mathbb{R}. We say that f is differentiable on U if it is differentiable at every point of U. In that case the first-order object at x \in U is the differential

\nabla f(x) := Df(x) \in E^\ast, \qquad x \in U,

and expressions such as \langle \nabla f(x), h \rangle are to be read as dual pairings.

Remark

1.1 Ambient convention for Lecture 1

Throughout this lecture, let E be a finite-dimensional real normed space and let E^\ast be its dual. We write

\langle \xi, h \rangle, \qquad \xi \in E^\ast,\ h \in E,

for the dual pairing. If a genuine inner product is being used, we will say so explicitly or decorate the notation, for instance by writing \langle u, v \rangle_H or \langle u, v \rangle_2.

Definition

1.4 Convex set, convex function, and strict convexity

A set C \subseteq E is convex if

\forall x,y \in C,\ \forall \theta \in [0,1],\qquad \theta x + (1-\theta)y \in C.

Let C \subseteq E be convex and let f : C \to \mathbb{R}. The function f is convex if

\forall x,y \in C,\ \forall \theta \in [0,1],\qquad f(\theta x + (1-\theta)y) \le \theta f(x) + (1-\theta)f(y).

It is strictly convex if

\forall x,y \in C \text{ with } x \ne y,\ \forall \theta \in (0,1),\qquad f(\theta x + (1-\theta)y) < \theta f(x) + (1-\theta)f(y).

1.1. Why Convexity?🔗

The point of studying convex optimization is not that all realistic optimization problems are convex. Rather, convexity is a good first structural assumption because it is strong enough to yield real global theorems, but not so strong that the subject collapses into a few toy examples.

Convexity is strong enough to make local information globally meaningful. Without additional structure, local information is usually only local. A derivative, an affine approximation, or a second-order expansion near one point says very little about what happens elsewhere. Convexity is the first major structural assumption in the course under which this changes in a robust way. Gradients, subgradients, supporting hyperplanes, and dual certificates stop being merely local descriptions and start becoming global lower certificates.
Convexity is broad and stable enough to support a systematic theory. A useful structural assumption should not apply only to a tiny family of specially engineered problems. Convexity still contains linear programs, least squares, logistic regression, constrained quadratic programs, second-order cone programs, semidefinite programs, and many regularized learning models. It is also stable under the operations optimization repeatedly uses, such as nonnegative linear combinations, affine changes of variables, epigraph constructions, partial minimization, and conjugation. Because of that, convex optimization can be developed as a real theory rather than a disconnected collection of tricks.
Convex optimization is the right baseline testbed for algorithms and complexity. Even when the eventual application is nonconvex, convex optimization is the cleanest setting in which one can first isolate the role of local primitives, prove global guarantees, and identify genuine complexity barriers. If a method cannot even be explained or stabilized on convex problems, then its behavior on more complicated problems is harder to interpret, not easier.
Convex optimization is also a source of ideas that survive beyond the convex setting. Its value is not limited to problems that are themselves convex. Many methods and viewpoints that later matter more broadly were first discovered, justified, or conceptually clarified in convex and online convex optimization. A concrete example is modern LLM training: the objective is highly nonconvex, yet the default optimizer AdamW belongs to a line of adaptive first-order methods whose ancestry runs through AdaGrad, a method that emerged from theoretical work in online convex optimization.

1.2. First Consequences of Convexity🔗

The next results are the first concrete consequences of convexity. For convex functions, local optimality is already global. On a convex feasible set, differentiability yields a first-order necessary condition for minimizers. Once convexity is added, that same first-order sign condition becomes not only necessary but also sufficient for global optimality.

Theorem

1.2 Local minima are global for convex functions

Let \Omega \subseteq E be nonempty and convex, let f : \Omega \to \mathbb{R} be convex, and let x^\star \in \Omega. If there exists r > 0 such that

\forall x \in \Omega \cap \{y \in E : \|y - x^\star\| < r\},\qquad f(x^\star) \le f(x),

then

\forall x \in \Omega,\qquad f(x^\star) \le f(x).

Proof

Assume for contradiction that there exists x \in \Omega such that

f(x) < f(x^\star).

Because \Omega is convex, for every \theta \in (0,1) the point

x_\theta := \theta x + (1-\theta)x^\star

belongs to \Omega. By convexity of f,

f(x_\theta) \le \theta f(x) + (1-\theta)f(x^\star) < f(x^\star).

Moreover,

\|x_\theta - x^\star\| = \theta \|x - x^\star\|.

Choosing \theta > 0 sufficiently small makes \|x_\theta - x^\star\| < r. Then x_\theta \in \Omega \cap \{y \in E : \|y - x^\star\| < r\} and

f(x_\theta) < f(x^\star),

contradicting the assumed local minimality of x^\star. Therefore no such x exists, and so

\forall x \in \Omega,\qquad f(x^\star) \le f(x).

Formal Statement and Proof

Lean theorem: Lecture01.thm_l1_local_global.

theorem proof_of_Lecture01_thm_l1_local_global {E : Type*}
    [NormedAddCommGroup E] [NormedSpace ℝ E]
    {Ω : Set E} {f : E → ℝ} {xStar : E}
    (hxStar : xStar ∈ Ω) (hconv : ConvexOn ℝ Ω f)
    (hlocal :
      ∃ r > 0,
        ∀ ⦃x : E⦄,
          x ∈ Ω →
          ‖x - xStar‖ < r →
          f xStar ≤ f x) :
    ∀ x ∈ Ω,
      f xStar ≤ f x := byE:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace ℝ EΩ:Set Ef:E → ℝxStar:EhxStar:xStar ∈ Ωhconv:ConvexOn ℝ Ω fhlocal:∃ r > 0, ∀ ⦃x : E⦄, x ∈ Ω → ‖x - xStar‖ < r → f xStar ≤ f x⊢ ∀ x ∈ Ω, f xStar ≤ f x
  obtain ⟨r, hr_pos, hlocal⟩ := hlocalE:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace ℝ EΩ:Set Ef:E → ℝxStar:EhxStar:xStar ∈ Ωhconv:ConvexOn ℝ Ω fr:ℝhr_pos:r > 0hlocal:∀ ⦃x : E⦄, x ∈ Ω → ‖x - xStar‖ < r → f xStar ≤ f x⊢ ∀ x ∈ Ω, f xStar ≤ f x
  have hlocalMin : IsLocalMinOn f Ω xStar := byE:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace ℝ EΩ:Set Ef:E → ℝxStar:EhxStar:xStar ∈ Ωhconv:ConvexOn ℝ Ω fr:ℝhr_pos:r > 0hlocal:∀ ⦃x : E⦄, x ∈ Ω → ‖x - xStar‖ < r → f xStar ≤ f x⊢ IsLocalMinOn f Ω xStar
    filter_upwards
      [inter_mem_nhdsWithin Ω (Metric.ball_mem_nhds xStar hr_pos)] with x hxhE:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace ℝ EΩ:Set Ef:E → ℝxStar:EhxStar:xStar ∈ Ωhconv:ConvexOn ℝ Ω fr:ℝhr_pos:r > 0hlocal:∀ ⦃x : E⦄, x ∈ Ω → ‖x - xStar‖ < r → f xStar ≤ f xx:Ehx:x ∈ Ω ∩ Metric.ball xStar r⊢ f xStar ≤ f x
    exact hlocal hx.1 (byE:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace ℝ EΩ:Set Ef:E → ℝxStar:EhxStar:xStar ∈ Ωhconv:ConvexOn ℝ Ω fr:ℝhr_pos:r > 0hlocal:∀ ⦃x : E⦄, x ∈ Ω → ‖x - xStar‖ < r → f xStar ≤ f xx:Ehx:x ∈ Ω ∩ Metric.ball xStar r⊢ ‖x - xStar‖ < r simpa [Metric.mem_ball, dist_eq_norm] using hx.2All goals completed! 🐙)
  exact IsMinOn.of_isLocalMinOn_of_convexOn hxStar hlocalMin hconvAll goals completed! 🐙

1.3. Examples of Convex Functions🔗

Example

1.1 Least squares

In the Euclidean model E = \mathbb{R}^d, given data (a_i,b_i)_{i=1}^m with a_i \in \mathbb{R}^d, consider

\min_{x \in \mathbb{R}^d}\ \frac{1}{2m}\sum_{i=1}^m (a_i^\top x-b_i)^2.

This problem is explicit, convex, and in favorable full-rank settings has a closed-form normal-equation solution. But in large-scale settings one still uses iterative algorithms. Even a problem with a recognizable formula is therefore not automatically computationally trivial.

Example

1.2 Logistic regression

Given labels y_i \in \{\pm1\}, consider

\min_{x \in \mathbb{R}^d}\ \frac{1}{m}\sum_{i=1}^m \log\!\bigl(1+e^{-y_i a_i^\top x}\bigr)+\frac{\lambda}{2}\|x\|_2^2.

This problem is again explicit and convex, but usually has no closed-form minimizer. Its importance is computational rather than symbolic: value and gradient are both natural to evaluate. Convexity here does not mean closed form; it means local information can become globally meaningful.

Example

1.3 A constrained convex quadratic program

In the Euclidean model E = \mathbb{R}^n, let Q \succeq 0, b \in \mathbb{R}^n, and let

\Omega := \{x \in \mathbb{R}^n : Cx=d,\ x \ge 0\}.

Then

\min\left\{\frac12 x^\top Qx+b^\top x : x\in\Omega\right\}.

This is a convex optimization problem with both objective geometry and explicit constraints. It previews later themes all at once: constrained optimality, dual variables, and KKT conditions.

1.4. Dependency and Proof Sketch🔗

The local-to-global theorem uses only the convexity inequality on the segment joining x^\star to an arbitrary x \in \Omega. If x^\star were only locally optimal and some distant x were strictly better, then the convex combination near x^\star would already contradict local optimality.
The first-order necessary condition is proved by differentiating the function

\phi(t) := f(x^\star+t(x-x^\star))

at t = 0^+, using global minimality of x^\star over a convex feasible set.
The global linear lower bound is proved by applying convexity to

f\bigl(x+t(y-x)\bigr)\le (1-t)f(x)+t f(y)

for t \in (0,1], rearranging, and letting t \downarrow 0.
The first-order characterization is the combination of the first-order necessary condition and the global linear lower bound,

f(y)\ge f(x)+\langle \nabla f(x), y-x\rangle.

1.5. Exercises🔗

Give an example of a nonconvex differentiable function on \mathbb{R}^2 that has a strict local minimizer that is not global. Then explain precisely where the local-to-global theorem fails.
Prove that if C \subseteq \mathbb{R}^n is convex and f : C \to \mathbb{R} is strictly convex, then f has at most one global minimizer on C.
Let f(x)=\max\{x_1,x_2,0\} on \mathbb{R}^2. Determine all global minimizers and explain why the differentiable convex first-order characterization does not apply.
A set C \subseteq \mathbb{R}^n is midpoint-convex if

\forall x,y \in C,\qquad \frac{x+y}{2}\in C.

Prove that every convex set is midpoint-convex. Then prove that every closed midpoint-convex set is convex. Finally, give a counterexample showing that midpoint-convexity alone does not imply convexity.