Convex Optimization Notes

1. Lecture 1: Introduction and Convexity🔗

In a high-level, optimization packages a decision task into three objects: a variable x, a feasible set \Omega, and an objective f. The feasible set records what is allowed; the objective records what is preferred. This language is broad enough to cover resource allocation, profit maximization, training a machine learning model, maximum-likelihood estimation, control, and minimum-energy problems. Once a model is written this way, the basic question becomes: among all feasible choices, which one has the smallest cost? In this course, we focus on the continuous optimization problems where the domain E is a finite-dimensional real space.

Definition
1.1 Optimization problem

Let f : E \to \mathbb{R} and let \Omega \subseteq E be nonempty. The optimization problem associated with (f,\Omega) is

p^\star := \inf\{f(x) : x \in \Omega\}.

A point x^\star \in \Omega is a global minimizer if f(x^\star) = p^\star, i.e., x^\star = \arg\min_{x \in \Omega} f(x).

Definition
1.2 Approximate optimality

Let (f,\Omega) be an optimization problem, and let

p^\star := \inf\{f(x) : x \in \Omega\}.

For every \varepsilon > 0, a point \widehat x \in \Omega is called \varepsilon-optimal if

f(\widehat x) \le p^\star + \varepsilon.

Basic outcomes of an optimization problem. Even at this level, several different things can happen. The feasible set may be empty, the infimum may be finite but not attained, or the problem may be unbounded below. This is one reason a near-optimal solution concept is introduced: exact minimizers may fail to exist, and even when they do exist they may be harder to compute than near-optimal points. Before discussing certificates of optimality, it is therefore natural to ask a more basic question: when does an exact minimizer exist at all? The next theorem gives the standard topological answer under compactness and continuity.

Theorem
1.1 Weierstrass

Let \Omega \subseteq E be nonempty and compact, and let f : \Omega \to \mathbb{R} be continuous. Then there exists x^\star \in \Omega such that

f(x^\star) = \min_{x \in \Omega} f(x).

Proof

Because \Omega is compact and f is continuous on \Omega, the image set f(\Omega) \subseteq \mathbb{R} is compact. In particular, f(\Omega) is nonempty and closed and bounded below, so it contains its infimum. Choose m \in f(\Omega) such that

m = \inf f(\Omega).

Then there exists x^\star \in \Omega with f(x^\star) = m. For every x \in \Omega one has f(x) \in f(\Omega) and therefore m \le f(x). Hence

f(x^\star) = m = \min_{x \in \Omega} f(x).

Formal Statement and Proof

Lean theorem: Lecture01.thm_l1_weier.

theorem proof_of_Lecture01_thm_l1_weier {E : Type*} [TopologicalSpace E] {Ω : Set E} {f : E } (hΩ_compact : IsCompact Ω) (hΩ_nonempty : Ω.Nonempty) (hf : ContinuousOn f Ω) : xStar Ω, IsMinOn f Ω xStar := E:Type u_1inst✝:TopologicalSpace EΩ:Set Ef:E hΩ_compact:IsCompact ΩhΩ_nonempty:Ω.Nonemptyhf:ContinuousOn f Ω xStar Ω, IsMinOn f Ω xStar All goals completed! 🐙

Specification versus computation. An optimization specification tells us what the mathematical problem is: it determines the feasible set, the objective, the optimal value, and the solution notions we care about, such as exact minimizers or \varepsilon-optimal points. But this still does not determine a computational problem. To speak about algorithms and complexity, we must also specify how the instance is presented, which primitive operations are available, what cost model is being counted, and what kind of output is required.

For finitely described models such as linear and quadratic programs, the input is a finite list of numbers and one counts arithmetic or bit operations. For large language models pretraining in practice, the input is a huge text corpus and code for model architecture, and one measures the total training wall-clock time on a GPU cluster. The same mathematical specification can therefore lead to very different computational questions under different access models. For example, the computational question will be very different for LLM pretraining if the GPU cluster is not owned but rented and the cost is actually the rental cost rather than the wall-clock time. Even for simple problems that admit closed-form solutions, the computational question will be very different if our unit-cost arithmetic operations are finite-precision or infinite-precision. There are indeed lots of algorithms that are preferred because of their numerical stability.

For the purpose of this course, we will mostly focus on an abstract and therefore general setting, where we assume the query oracle about local information, such as the value, gradient, or Hessian of the objective at each point, is the relatively cheap primitive. Throughout the course we assume we can work with real numbers with infinite precision.

Definition
1.3 Differentiability on an open set

Let U \subseteq E be open and let f : U \to \mathbb{R}. We say that f is differentiable on U if it is differentiable at every point of U. In that case the first-order object at x \in U is the differential

\nabla f(x) := Df(x) \in E^\ast, \qquad x \in U,

and expressions such as \langle \nabla f(x), h \rangle are to be read as dual pairings.

Remark
1.1 Ambient convention for Lecture 1

Throughout this lecture, let E be a finite-dimensional real normed space and let E^\ast be its dual. We write

\langle \xi, h \rangle, \qquad \xi \in E^\ast,\ h \in E,

for the dual pairing. If a genuine inner product is being used, we will say so explicitly or decorate the notation, for instance by writing \langle u, v \rangle_H or \langle u, v \rangle_2.

Definition
1.4 Convex set, convex function, and strict convexity

A set C \subseteq E is convex if

\forall x,y \in C,\ \forall \theta \in [0,1],\qquad \theta x + (1-\theta)y \in C.

Let C \subseteq E be convex and let f : C \to \mathbb{R}. The function f is convex if

\forall x,y \in C,\ \forall \theta \in [0,1],\qquad f(\theta x + (1-\theta)y) \le \theta f(x) + (1-\theta)f(y).

It is strictly convex if

\forall x,y \in C \text{ with } x \ne y,\ \forall \theta \in (0,1),\qquad f(\theta x + (1-\theta)y) < \theta f(x) + (1-\theta)f(y).

1.1. Why Convexity?🔗

The point of studying convex optimization is not that all realistic optimization problems are convex. Rather, convexity is a good first structural assumption because it is strong enough to yield real global theorems, but not so strong that the subject collapses into a few toy examples.

  1. Convexity is strong enough to make local information globally meaningful. Without additional structure, local information is usually only local. A derivative, an affine approximation, or a second-order expansion near one point says very little about what happens elsewhere. Convexity is the first major structural assumption in the course under which this changes in a robust way. Gradients, subgradients, supporting hyperplanes, and dual certificates stop being merely local descriptions and start becoming global lower certificates.

  2. Convexity is broad and stable enough to support a systematic theory. A useful structural assumption should not apply only to a tiny family of specially engineered problems. Convexity still contains linear programs, least squares, logistic regression, constrained quadratic programs, second-order cone programs, semidefinite programs, and many regularized learning models. It is also stable under the operations optimization repeatedly uses, such as nonnegative linear combinations, affine changes of variables, epigraph constructions, partial minimization, and conjugation. Because of that, convex optimization can be developed as a real theory rather than a disconnected collection of tricks.

  3. Convex optimization is the right baseline testbed for algorithms and complexity. Even when the eventual application is nonconvex, convex optimization is the cleanest setting in which one can first isolate the role of local primitives, prove global guarantees, and identify genuine complexity barriers. If a method cannot even be explained or stabilized on convex problems, then its behavior on more complicated problems is harder to interpret, not easier.

  4. Convex optimization is also a source of ideas that survive beyond the convex setting. Its value is not limited to problems that are themselves convex. Many methods and viewpoints that later matter more broadly were first discovered, justified, or conceptually clarified in convex and online convex optimization. A concrete example is modern LLM training: the objective is highly nonconvex, yet the default optimizer AdamW belongs to a line of adaptive first-order methods whose ancestry runs through AdaGrad, a method that emerged from theoretical work in online convex optimization.

1.2. First Consequences of Convexity🔗

The next results are the first concrete consequences of convexity. For convex functions, local optimality is already global. On a convex feasible set, differentiability yields a first-order necessary condition for minimizers. Once convexity is added, that same first-order sign condition becomes not only necessary but also sufficient for global optimality.

Theorem
1.2 Local minima are global for convex functions

Let \Omega \subseteq E be nonempty and convex, let f : \Omega \to \mathbb{R} be convex, and let x^\star \in \Omega. If there exists r > 0 such that

\forall x \in \Omega \cap \{y \in E : \|y - x^\star\| < r\},\qquad f(x^\star) \le f(x),

then

\forall x \in \Omega,\qquad f(x^\star) \le f(x).

Proof

Assume for contradiction that there exists x \in \Omega such that

f(x) < f(x^\star).

Because \Omega is convex, for every \theta \in (0,1) the point

x_\theta := \theta x + (1-\theta)x^\star

belongs to \Omega. By convexity of f,

f(x_\theta) \le \theta f(x) + (1-\theta)f(x^\star) < f(x^\star).

Moreover,

\|x_\theta - x^\star\| = \theta \|x - x^\star\|.

Choosing \theta > 0 sufficiently small makes \|x_\theta - x^\star\| < r. Then x_\theta \in \Omega \cap \{y \in E : \|y - x^\star\| < r\} and

f(x_\theta) < f(x^\star),

contradicting the assumed local minimality of x^\star. Therefore no such x exists, and so

\forall x \in \Omega,\qquad f(x^\star) \le f(x).

Formal Statement and Proof

Lean theorem: Lecture01.thm_l1_local_global.

theorem proof_of_Lecture01_thm_l1_local_global {E : Type*} [NormedAddCommGroup E] [NormedSpace E] {Ω : Set E} {f : E } {xStar : E} (hxStar : xStar Ω) (hconv : ConvexOn Ω f) (hlocal : r > 0, x : E, x Ω x - xStar < r f xStar f x) : x Ω, f xStar f x := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhxStar:xStar Ωhconv:ConvexOn Ω fhlocal: r > 0, x : E⦄, x Ω x - xStar < r f xStar f x x Ω, f xStar f x E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhxStar:xStar Ωhconv:ConvexOn Ω fr:hr_pos:r > 0hlocal: x : E⦄, x Ω x - xStar < r f xStar f x x Ω, f xStar f x have hlocalMin : IsLocalMinOn f Ω xStar := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhxStar:xStar Ωhconv:ConvexOn Ω fr:hr_pos:r > 0hlocal: x : E⦄, x Ω x - xStar < r f xStar f xIsLocalMinOn f Ω xStar filter_upwards [inter_mem_nhdsWithin Ω (Metric.ball_mem_nhds xStar hr_pos)] with x E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhxStar:xStar Ωhconv:ConvexOn Ω fr:hr_pos:r > 0hlocal: x : E⦄, x Ω x - xStar < r f xStar f xx:Ehx:x Ω Metric.ball xStar rf xStar f x exact hlocal hx.1 (E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhxStar:xStar Ωhconv:ConvexOn Ω fr:hr_pos:r > 0hlocal: x : E⦄, x Ω x - xStar < r f xStar f xx:Ehx:x Ω Metric.ball xStar rx - xStar < r All goals completed! 🐙) All goals completed! 🐙
Lemma
1.3 First-order necessary condition on a convex feasible set

Let \Omega \subseteq E be nonempty and convex, let f : E \to \mathbb{R} be differentiable, and let x^\star \in \Omega be a global minimizer of f over \Omega. Then

\forall x \in \Omega,\qquad \langle \nabla f(x^\star), x - x^\star \rangle \ge 0.

Proof

Fix any x \in \Omega and define

\phi(t) := f\bigl(x^\star + t(x-x^\star)\bigr), \qquad t \in [0,1].

Because \Omega is convex, one has x^\star + t(x-x^\star) \in \Omega for every t \in [0,1]. Since x^\star is a global minimizer of f over \Omega,

\phi(t) \ge \phi(0) \qquad \forall t \in [0,1].

Hence the right derivative of \phi at 0 is nonnegative:

\phi'_+(0) \ge 0.

Because f is differentiable,

\phi'_+(0) = \langle \nabla f(x^\star), x-x^\star \rangle.

Therefore

\langle \nabla f(x^\star), x-x^\star \rangle \ge 0.

Since x \in \Omega was arbitrary, the claim follows.

Formal Statement and Proof

Lean theorem: Lecture01.lem_l1_first_order_necessary.

theorem proof_of_Lecture01_lem_l1_first_order_necessary {E : Type*} [NormedAddCommGroup E] [NormedSpace E] {Ω : Set E} {f : E } {xStar : E} (hΩ_convex : Convex Ω) (hxStar : xStar Ω) (hmin : y Ω, f xStar f y) (hf : Differentiable f) : x Ω, 0 fderiv f xStar (x - xStar) := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable f x Ω, 0 (fderiv f xStar) (x - xStar) intro x E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ω0 (fderiv f xStar) (x - xStar) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)0 (fderiv f xStar) (x - xStar) have hmaps : Set.MapsTo (AffineMap.lineMap xStar x) (Set.Icc (0 : ) 1) Ω := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω All goals completed! 🐙 have hφmin : IsMinOn φ (Set.Icc (0 : ) 1) 0 := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))IsMinOn φ (Set.Icc 0 1) 0 intro t E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))t:ht:t Set.Icc 0 1t {x | (fun x => φ 0 φ x) x} E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))t:ht:t Set.Icc 0 1hmin_t:f xStar f ((AffineMap.lineMap xStar x) t) := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht)t {x | (fun x => φ 0 φ x) x} All goals completed! 🐙 E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))hφmin:IsMinOn φ (Set.Icc 0 1) 0 := fun t ht => have hmin_t := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht); Eq.mpr (id (Eq.trans (congrFun' (congrArg Membership.mem (congrArg setOf (funext fun x_1 => congrFun' (congrArg LE.le (congrArg f (AffineMap.lineMap_apply_zero xStar x))) (f ((AffineMap.lineMap xStar x) x_1))))) t) ge_iff_le._simp_1)) hmin_thφlocal:IsLocalMinOn φ (Set.Icc 0 1) 0 := IsMinOn.localize hφmin0 (fderiv f xStar) (x - xStar) have hfAt0 : HasFDerivAt f (fderiv f xStar) ((AffineMap.lineMap xStar x) (0 : )) := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))hφmin:IsMinOn φ (Set.Icc 0 1) 0 := fun t ht => have hmin_t := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht); Eq.mpr (id (Eq.trans (congrFun' (congrArg Membership.mem (congrArg setOf (funext fun x_1 => congrFun' (congrArg LE.le (congrArg f (AffineMap.lineMap_apply_zero xStar x))) (f ((AffineMap.lineMap xStar x) x_1))))) t) ge_iff_le._simp_1)) hmin_thφlocal:IsLocalMinOn φ (Set.Icc 0 1) 0 := IsMinOn.localize hφminHasFDerivAt f (fderiv f xStar) ((AffineMap.lineMap xStar x) 0) All goals completed! 🐙 have hlineDerivAt : HasDerivAt φ ((fderiv f xStar) (x - xStar)) 0 := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))hφmin:IsMinOn φ (Set.Icc 0 1) 0 := fun t ht => have hmin_t := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht); Eq.mpr (id (Eq.trans (congrFun' (congrArg Membership.mem (congrArg setOf (funext fun x_1 => congrFun' (congrArg LE.le (congrArg f (AffineMap.lineMap_apply_zero xStar x))) (f ((AffineMap.lineMap xStar x) x_1))))) t) ge_iff_le._simp_1)) hmin_thφlocal:IsLocalMinOn φ (Set.Icc 0 1) 0 := IsMinOn.localize hφminhfAt0:HasFDerivAt f (fderiv f xStar) ((AffineMap.lineMap xStar x) 0) := Eq.mpr (id (congrArg (HasFDerivAt f (fderiv f xStar)) (AffineMap.lineMap_apply_zero xStar x))) (DifferentiableAt.hasFDerivAt (hf xStar))HasDerivAt φ ((fderiv f xStar) (x - xStar)) 0 All goals completed! 🐙 E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))hφmin:IsMinOn φ (Set.Icc 0 1) 0 := fun t ht => have hmin_t := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht); Eq.mpr (id (Eq.trans (congrFun' (congrArg Membership.mem (congrArg setOf (funext fun x_1 => congrFun' (congrArg LE.le (congrArg f (AffineMap.lineMap_apply_zero xStar x))) (f ((AffineMap.lineMap xStar x) x_1))))) t) ge_iff_le._simp_1)) hmin_thφlocal:IsLocalMinOn φ (Set.Icc 0 1) 0 := IsMinOn.localize hφminhfAt0:HasFDerivAt f (fderiv f xStar) ((AffineMap.lineMap xStar x) 0) := Eq.mpr (id (congrArg (HasFDerivAt f (fderiv f xStar)) (AffineMap.lineMap_apply_zero xStar x))) (DifferentiableAt.hasFDerivAt (hf xStar))hlineDerivAt:HasDerivAt φ ((fderiv f xStar) (x - xStar)) 0 := Eq.mpr (id (HasDerivAt.congr_simp φ (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap xStar x)) (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt 0 hfAt0 AffineMap.hasDerivAt_lineMap))hlineDeriv:HasDerivWithinAt φ ((fderiv f xStar) (x - xStar)) (Set.Icc 0 1) 0 := HasDerivAt.hasDerivWithinAt hlineDerivAt0 (fderiv f xStar) (x - xStar) have hdir1 : (1 : ) posTangentConeAt (Set.Icc (0 : ) 1) (0 : ) := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))hφmin:IsMinOn φ (Set.Icc 0 1) 0 := fun t ht => have hmin_t := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht); Eq.mpr (id (Eq.trans (congrFun' (congrArg Membership.mem (congrArg setOf (funext fun x_1 => congrFun' (congrArg LE.le (congrArg f (AffineMap.lineMap_apply_zero xStar x))) (f ((AffineMap.lineMap xStar x) x_1))))) t) ge_iff_le._simp_1)) hmin_thφlocal:IsLocalMinOn φ (Set.Icc 0 1) 0 := IsMinOn.localize hφminhfAt0:HasFDerivAt f (fderiv f xStar) ((AffineMap.lineMap xStar x) 0) := Eq.mpr (id (congrArg (HasFDerivAt f (fderiv f xStar)) (AffineMap.lineMap_apply_zero xStar x))) (DifferentiableAt.hasFDerivAt (hf xStar))hlineDerivAt:HasDerivAt φ ((fderiv f xStar) (x - xStar)) 0 := Eq.mpr (id (HasDerivAt.congr_simp φ (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap xStar x)) (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt 0 hfAt0 AffineMap.hasDerivAt_lineMap))hlineDeriv:HasDerivWithinAt φ ((fderiv f xStar) (x - xStar)) (Set.Icc 0 1) 0 := HasDerivAt.hasDerivWithinAt hlineDerivAt1 posTangentConeAt (Set.Icc 0 1) 0 E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))hφmin:IsMinOn φ (Set.Icc 0 1) 0 := fun t ht => have hmin_t := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht); Eq.mpr (id (Eq.trans (congrFun' (congrArg Membership.mem (congrArg setOf (funext fun x_1 => congrFun' (congrArg LE.le (congrArg f (AffineMap.lineMap_apply_zero xStar x))) (f ((AffineMap.lineMap xStar x) x_1))))) t) ge_iff_le._simp_1)) hmin_thφlocal:IsLocalMinOn φ (Set.Icc 0 1) 0 := IsMinOn.localize hφminhfAt0:HasFDerivAt f (fderiv f xStar) ((AffineMap.lineMap xStar x) 0) := Eq.mpr (id (congrArg (HasFDerivAt f (fderiv f xStar)) (AffineMap.lineMap_apply_zero xStar x))) (DifferentiableAt.hasFDerivAt (hf xStar))hlineDerivAt:HasDerivAt φ ((fderiv f xStar) (x - xStar)) 0 := Eq.mpr (id (HasDerivAt.congr_simp φ (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap xStar x)) (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt 0 hfAt0 AffineMap.hasDerivAt_lineMap))hlineDeriv:HasDerivWithinAt φ ((fderiv f xStar) (x - xStar)) (Set.Icc 0 1) 0 := HasDerivAt.hasDerivWithinAt hlineDerivAthseg01:segment 0 1 Set.Icc 0 1 := Convex.segment_subset (convex_Icc 0 1) (Set.left_mem_Icc.mpr zero_le_one) (Set.right_mem_Icc.mpr zero_le_one)1 posTangentConeAt (Set.Icc 0 1) 0 simpa using (mem_posTangentConeAt_of_segment_subset (x := (0 : )) (y := (1 : )) (show segment (0 : ) ((0 : ) + 1) Set.Icc (0 : ) 1 E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))hφmin:IsMinOn φ (Set.Icc 0 1) 0 := fun t ht => have hmin_t := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht); Eq.mpr (id (Eq.trans (congrFun' (congrArg Membership.mem (congrArg setOf (funext fun x_1 => congrFun' (congrArg LE.le (congrArg f (AffineMap.lineMap_apply_zero xStar x))) (f ((AffineMap.lineMap xStar x) x_1))))) t) ge_iff_le._simp_1)) hmin_thφlocal:IsLocalMinOn φ (Set.Icc 0 1) 0 := IsMinOn.localize hφminhfAt0:HasFDerivAt f (fderiv f xStar) ((AffineMap.lineMap xStar x) 0) := Eq.mpr (id (congrArg (HasFDerivAt f (fderiv f xStar)) (AffineMap.lineMap_apply_zero xStar x))) (DifferentiableAt.hasFDerivAt (hf xStar))hlineDerivAt:HasDerivAt φ ((fderiv f xStar) (x - xStar)) 0 := Eq.mpr (id (HasDerivAt.congr_simp φ (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap xStar x)) (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt 0 hfAt0 AffineMap.hasDerivAt_lineMap))hlineDeriv:HasDerivWithinAt φ ((fderiv f xStar) (x - xStar)) (Set.Icc 0 1) 0 := HasDerivAt.hasDerivWithinAt hlineDerivAthseg01:segment 0 1 Set.Icc 0 1 := Convex.segment_subset (convex_Icc 0 1) (Set.left_mem_Icc.mpr zero_le_one) (Set.right_mem_Icc.mpr zero_le_one)segment 0 (0 + 1) Set.Icc 0 1 try 'simp' instead of 'simpa' Note: This linter can be disabled with `set_option linter.unnecessarySimpa false`All goals completed! 🐙)) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:EhΩ_convex:Convex ΩhxStar:xStar Ωhmin: y Ω, f xStar f yhf:Differentiable fx:Ehx:x Ωφ: := f (AffineMap.lineMap xStar x)hmaps:Set.MapsTo (⇑(AffineMap.lineMap xStar x)) (Set.Icc 0 1) Ω := Eq.mpr (id (Eq.trans proof_of_Lecture01_lem_l1_first_order_necessary._simp_1 Set.image_subset_iff._simp_1)) (Eq.mp (Eq.trans (congrFun' (congrArg Subset (segment_eq_image_lineMap xStar x)) Ω) Set.image_subset_iff._simp_1) (Convex.segment_subset hΩ_convex hxStar hx))hφmin:IsMinOn φ (Set.Icc 0 1) 0 := fun t ht => have hmin_t := hmin ((AffineMap.lineMap xStar x) t) (hmaps ht); Eq.mpr (id (Eq.trans (congrFun' (congrArg Membership.mem (congrArg setOf (funext fun x_1 => congrFun' (congrArg LE.le (congrArg f (AffineMap.lineMap_apply_zero xStar x))) (f ((AffineMap.lineMap xStar x) x_1))))) t) ge_iff_le._simp_1)) hmin_thφlocal:IsLocalMinOn φ (Set.Icc 0 1) 0 := IsMinOn.localize hφminhfAt0:HasFDerivAt f (fderiv f xStar) ((AffineMap.lineMap xStar x) 0) := Eq.mpr (id (congrArg (HasFDerivAt f (fderiv f xStar)) (AffineMap.lineMap_apply_zero xStar x))) (DifferentiableAt.hasFDerivAt (hf xStar))hlineDerivAt:HasDerivAt φ ((fderiv f xStar) (x - xStar)) 0 := Eq.mpr (id (HasDerivAt.congr_simp φ (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap xStar x)) (f (AffineMap.lineMap xStar x)) (Eq.refl (f (AffineMap.lineMap xStar x))) ((fderiv f xStar) (x - xStar)) ((fderiv f xStar) x - (fderiv f xStar) xStar) (map_sub (fderiv f xStar) x xStar) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt 0 hfAt0 AffineMap.hasDerivAt_lineMap))hlineDeriv:HasDerivWithinAt φ ((fderiv f xStar) (x - xStar)) (Set.Icc 0 1) 0 := HasDerivAt.hasDerivWithinAt hlineDerivAthdir1:1 posTangentConeAt (Set.Icc 0 1) 0 := have hseg01 := Convex.segment_subset (convex_Icc 0 1) (Set.left_mem_Icc.mpr zero_le_one) (Set.right_mem_Icc.mpr zero_le_one); mem_posTangentConeAt_of_segment_subset (have this := of_eq_true (Eq.trans (congrFun' (congrArg Subset (Eq.trans (congrArg (segment 0) (zero_add 1)) (segment_eq_Icc (of_eq_true zero_le_one._simp_1)))) (Set.Icc 0 1)) (subset_refl._simp_1 (Set.Icc 0 1))); this)hnonneg:0 (ContinuousLinearMap.toSpanSingleton ((fderiv f xStar) (x - xStar))) 1 := IsLocalMinOn.hasFDerivWithinAt_nonneg hφlocal (HasDerivWithinAt.hasFDerivWithinAt hlineDeriv) hdir10 (fderiv f xStar) (x - xStar) All goals completed! 🐙
Lemma
1.4 Differentiable convex functions admit a global linear lower bound

Let \Omega \subseteq E be nonempty and convex, let f : E \to \mathbb{R} be differentiable, and assume that f is convex on \Omega. Then

\forall x,y \in \Omega,\qquad f(y) \ge f(x) + \langle \nabla f(x), y-x \rangle.

Proof

Fix x,y \in \Omega. For every t \in (0,1], convexity of f on \Omega gives

f\bigl(x+t(y-x)\bigr) \le (1-t)f(x) + t f(y).

Rearranging, we obtain

\frac{f\bigl(x+t(y-x)\bigr)-f(x)}{t} \le f(y)-f(x).

Because f is differentiable at x, letting t \downarrow 0 yields

\langle \nabla f(x), y-x \rangle \le f(y)-f(x).

Equivalently,

f(y) \ge f(x) + \langle \nabla f(x), y-x \rangle.

Since x,y \in \Omega were arbitrary, the claim follows.

Formal Statement and Proof

Lean theorem: Lecture01.lem_l1_gradient_lower_bound.

theorem proof_of_Lecture01_lem_l1_gradient_lower_bound {E : Type*} [NormedAddCommGroup E] [NormedSpace E] {Ω : Set E} {f : E } (hconv : ConvexOn Ω f) (hf : Differentiable f) : x Ω, y Ω, f y f x + fderiv f x (y - x) := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable f x Ω, y Ω, f y f x + (fderiv f x) (y - x) intro x E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ω y Ω, f y f x + (fderiv f x) (y - x) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ey Ω f y f x + (fderiv f x) (y - x) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωf y f x + (fderiv f x) (y - x) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:y = xf y f x + (fderiv f x) (y - x)E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xf y f x + (fderiv f x) (y - x) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:y = xf y f x + (fderiv f x) (y - x) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fy:Ehy:y Ωhx:y Ωf y f y + (fderiv f y) (y - y) All goals completed! 🐙 E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yf y f x + (fderiv f x) (y - x) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yhmaps:Set.MapsTo (⇑g) (Set.Icc 0 1) Ω := Convex.mapsTo_lineMap hconv.left hx hyf y f x + (fderiv f x) (y - x) have hconv_g : ConvexOn (Set.Icc (0 : ) 1) (fun t : => f (g t)) := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yhmaps:Set.MapsTo (⇑g) (Set.Icc 0 1) Ω := Convex.mapsTo_lineMap hconv.left hx hyConvexOn (Set.Icc 0 1) fun t => f (g t) All goals completed! 🐙 have hgderiv : HasDerivAt (fun t : => f (g t)) ((fderiv f x) (y - x)) 0 := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yhmaps:Set.MapsTo (⇑g) (Set.Icc 0 1) Ω := Convex.mapsTo_lineMap hconv.left hx hyhconv_g:ConvexOn (Set.Icc 0 1) fun t => f (g t) := ConvexOn.subset (ConvexOn.comp_affineMap g hconv) hmaps (convex_Icc 0 1)HasDerivAt (fun t => f (g t)) ((fderiv f x) (y - x)) 0 simpa using (HasFDerivAt.comp_hasDerivAt_of_eq (x := (0 : )) (hf x).hasFDerivAt (AffineMap.hasDerivAt_lineMap (a := x) (b := y) (x := (0 : ))) (E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yhmaps:Set.MapsTo (⇑g) (Set.Icc 0 1) Ω := Convex.mapsTo_lineMap hconv.left hx hyhconv_g:ConvexOn (Set.Icc 0 1) fun t => f (g t) := ConvexOn.subset (ConvexOn.comp_affineMap g hconv) hmaps (convex_Icc 0 1)x = (AffineMap.lineMap x y) 0 All goals completed! 🐙)) have hslope : (fderiv f x) (y - x) slope (fun t : => f (g t)) 0 1 := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yhmaps:Set.MapsTo (⇑g) (Set.Icc 0 1) Ω := Convex.mapsTo_lineMap hconv.left hx hyhconv_g:ConvexOn (Set.Icc 0 1) fun t => f (g t) := ConvexOn.subset (ConvexOn.comp_affineMap g hconv) hmaps (convex_Icc 0 1)hgderiv:HasDerivAt (fun t => f (g t)) ((fderiv f x) (y - x)) 0 := Eq.mpr (id (HasDerivAt.congr_simp (fun t => f (g t)) (fun t => f (g t)) (Eq.refl fun t => f (g t)) ((fderiv f x) (y - x)) ((fderiv f x) y - (fderiv f x) x) (map_sub (fderiv f x) y x) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap x y)) (f (AffineMap.lineMap x y)) (Eq.refl (f (AffineMap.lineMap x y))) ((fderiv f x) (y - x)) ((fderiv f x) y - (fderiv f x) x) (map_sub (fderiv f x) y x) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt_of_eq 0 (DifferentiableAt.hasFDerivAt (hf x)) AffineMap.hasDerivAt_lineMap (of_eq_true (Eq.trans (congrArg (Eq x) (AffineMap.lineMap_apply_zero x y)) (eq_self x)))))(fderiv f x) (y - x) slope (fun t => f (g t)) 0 1 exact hconv_g.le_slope_of_hasDerivAt (E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yhmaps:Set.MapsTo (⇑g) (Set.Icc 0 1) Ω := Convex.mapsTo_lineMap hconv.left hx hyhconv_g:ConvexOn (Set.Icc 0 1) fun t => f (g t) := ConvexOn.subset (ConvexOn.comp_affineMap g hconv) hmaps (convex_Icc 0 1)hgderiv:HasDerivAt (fun t => f (g t)) ((fderiv f x) (y - x)) 0 := Eq.mpr (id (HasDerivAt.congr_simp (fun t => f (g t)) (fun t => f (g t)) (Eq.refl fun t => f (g t)) ((fderiv f x) (y - x)) ((fderiv f x) y - (fderiv f x) x) (map_sub (fderiv f x) y x) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap x y)) (f (AffineMap.lineMap x y)) (Eq.refl (f (AffineMap.lineMap x y))) ((fderiv f x) (y - x)) ((fderiv f x) y - (fderiv f x) x) (map_sub (fderiv f x) y x) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt_of_eq 0 (DifferentiableAt.hasFDerivAt (hf x)) AffineMap.hasDerivAt_lineMap (of_eq_true (Eq.trans (congrArg (Eq x) (AffineMap.lineMap_apply_zero x y)) (eq_self x)))))0 Set.Icc 0 1 All goals completed! 🐙) (E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yhmaps:Set.MapsTo (⇑g) (Set.Icc 0 1) Ω := Convex.mapsTo_lineMap hconv.left hx hyhconv_g:ConvexOn (Set.Icc 0 1) fun t => f (g t) := ConvexOn.subset (ConvexOn.comp_affineMap g hconv) hmaps (convex_Icc 0 1)hgderiv:HasDerivAt (fun t => f (g t)) ((fderiv f x) (y - x)) 0 := Eq.mpr (id (HasDerivAt.congr_simp (fun t => f (g t)) (fun t => f (g t)) (Eq.refl fun t => f (g t)) ((fderiv f x) (y - x)) ((fderiv f x) y - (fderiv f x) x) (map_sub (fderiv f x) y x) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap x y)) (f (AffineMap.lineMap x y)) (Eq.refl (f (AffineMap.lineMap x y))) ((fderiv f x) (y - x)) ((fderiv f x) y - (fderiv f x) x) (map_sub (fderiv f x) y x) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt_of_eq 0 (DifferentiableAt.hasFDerivAt (hf x)) AffineMap.hasDerivAt_lineMap (of_eq_true (Eq.trans (congrArg (Eq x) (AffineMap.lineMap_apply_zero x y)) (eq_self x)))))1 Set.Icc 0 1 All goals completed! 🐙) zero_lt_one hgderiv have hslope' : (fderiv f x) (y - x) f y - f x := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E hconv:ConvexOn Ω fhf:Differentiable fx:Ehx:x Ωy:Ehy:y Ωhxy:¬y = xg: →ᵃ[] E := AffineMap.lineMap x yhmaps:Set.MapsTo (⇑g) (Set.Icc 0 1) Ω := Convex.mapsTo_lineMap hconv.left hx hyhconv_g:ConvexOn (Set.Icc 0 1) fun t => f (g t) := ConvexOn.subset (ConvexOn.comp_affineMap g hconv) hmaps (convex_Icc 0 1)hgderiv:HasDerivAt (fun t => f (g t)) ((fderiv f x) (y - x)) 0 := Eq.mpr (id (HasDerivAt.congr_simp (fun t => f (g t)) (fun t => f (g t)) (Eq.refl fun t => f (g t)) ((fderiv f x) (y - x)) ((fderiv f x) y - (fderiv f x) x) (map_sub (fderiv f x) y x) 0 0 (Eq.refl 0))) (Eq.mp (HasDerivAt.congr_simp (f (AffineMap.lineMap x y)) (f (AffineMap.lineMap x y)) (Eq.refl (f (AffineMap.lineMap x y))) ((fderiv f x) (y - x)) ((fderiv f x) y - (fderiv f x) x) (map_sub (fderiv f x) y x) 0 0 (Eq.refl 0)) (HasFDerivAt.comp_hasDerivAt_of_eq 0 (DifferentiableAt.hasFDerivAt (hf x)) AffineMap.hasDerivAt_lineMap (of_eq_true (Eq.trans (congrArg (Eq x) (AffineMap.lineMap_apply_zero x y)) (eq_self x)))))hslope:(fderiv f x) (y - x) slope (fun t => f (g t)) 0 1 := ConvexOn.le_slope_of_hasDerivAt hconv_g (of_eq_true (Eq.trans Set.mem_Icc._simp_1 (Eq.trans (congr (congrArg And (le_refl._simp_1 0)) zero_le_one._simp_1) (and_self True)))) (of_eq_true (Eq.trans Set.mem_Icc._simp_1 (Eq.trans (congr (congrArg And zero_le_one._simp_1) (le_refl._simp_1 1)) (and_self True)))) zero_lt_one hgderiv(fderiv f x) (y - x) f y - f x All goals completed! 🐙 All goals completed! 🐙
Theorem
1.5 Differentiable convex first-order characterization on a convex feasible set

Let \Omega \subseteq E be nonempty and convex, let f : E \to \mathbb{R} be differentiable, assume that f is convex on \Omega, and let x^\star \in \Omega. Then the following are equivalent:

  1. f(x^\star) \le f(x) for every x \in \Omega;

  2. \langle \nabla f(x^\star), x - x^\star \rangle \ge 0 for every x \in \Omega.

Proof

We first prove that item (1) implies item (2). If f(x^\star) \le f(x) for every x \in \Omega, then x^\star is a global minimizer of f on \Omega. Applying the first-order necessary condition yields

\forall x \in \Omega,\qquad \langle \nabla f(x^\star), x-x^\star \rangle \ge 0.

We next prove that item (2) implies item (1). Fix any x \in \Omega. Applying the global linear lower bound with x^\star and x, we obtain

f(x) \ge f(x^\star) + \langle \nabla f(x^\star), x-x^\star \rangle.

By item (2),

\langle \nabla f(x^\star), x-x^\star \rangle \ge 0.

Hence

f(x) \ge f(x^\star).

Because x \in \Omega was arbitrary, item (1) follows.

Formal Statement and Proof

Lean theorem: Lecture01.thm_l1_first_order_char.

theorem proof_of_Lecture01_thm_l1_first_order_char {E : Type*} [NormedAddCommGroup E] [NormedSpace E] {Ω : Set E} {f : E } {xStar : E} (hconv : ConvexOn Ω f) (hxStar : xStar Ω) (hf : Differentiable f) : ( x Ω, f xStar f x) ( x Ω, 0 fderiv f xStar (x - xStar)) := E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable f(∀ x Ω, f xStar f x) x Ω, 0 (fderiv f xStar) (x - xStar) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable f(∀ x Ω, f xStar f x) x Ω, 0 (fderiv f xStar) (x - xStar)E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable f(∀ x Ω, 0 (fderiv f xStar) (x - xStar)) x Ω, f xStar f x E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable f(∀ x Ω, f xStar f x) x Ω, 0 (fderiv f xStar) (x - xStar) E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable fhmin: x Ω, f xStar f x x Ω, 0 (fderiv f xStar) (x - xStar) All goals completed! 🐙 E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable f(∀ x Ω, 0 (fderiv f xStar) (x - xStar)) x Ω, f xStar f x intro hgrad E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable fhgrad: x Ω, 0 (fderiv f xStar) (x - xStar)x:Ex Ω f xStar f x E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable fhgrad: x Ω, 0 (fderiv f xStar) (x - xStar)x:Ehx:x Ωf xStar f x E:Type u_1inst✝¹:NormedAddCommGroup Einst✝:NormedSpace EΩ:Set Ef:E xStar:Ehconv:ConvexOn Ω fhxStar:xStar Ωhf:Differentiable fhgrad: x Ω, 0 (fderiv f xStar) (x - xStar)x:Ehx:x Ωhlower:f x f xStar + (fderiv f xStar) (x - xStar) := lem_l1_gradient_lower_bound hconv hf xStar hxStar x hxf xStar f x All goals completed! 🐙

1.3. Examples of Convex Functions🔗

Example
1.1 Least squares

In the Euclidean model E = \mathbb{R}^d, given data (a_i,b_i)_{i=1}^m with a_i \in \mathbb{R}^d, consider

\min_{x \in \mathbb{R}^d}\ \frac{1}{2m}\sum_{i=1}^m (a_i^\top x-b_i)^2.

This problem is explicit, convex, and in favorable full-rank settings has a closed-form normal-equation solution. But in large-scale settings one still uses iterative algorithms. Even a problem with a recognizable formula is therefore not automatically computationally trivial.

Example
1.2 Logistic regression

Given labels y_i \in \{\pm1\}, consider

\min_{x \in \mathbb{R}^d}\ \frac{1}{m}\sum_{i=1}^m \log\!\bigl(1+e^{-y_i a_i^\top x}\bigr)+\frac{\lambda}{2}\|x\|_2^2.

This problem is again explicit and convex, but usually has no closed-form minimizer. Its importance is computational rather than symbolic: value and gradient are both natural to evaluate. Convexity here does not mean closed form; it means local information can become globally meaningful.

Example
1.3 A constrained convex quadratic program

In the Euclidean model E = \mathbb{R}^n, let Q \succeq 0, b \in \mathbb{R}^n, and let

\Omega := \{x \in \mathbb{R}^n : Cx=d,\ x \ge 0\}.

Then

\min\left\{\frac12 x^\top Qx+b^\top x : x\in\Omega\right\}.

This is a convex optimization problem with both objective geometry and explicit constraints. It previews later themes all at once: constrained optimality, dual variables, and KKT conditions.

1.4. Dependency and Proof Sketch🔗

  1. The local-to-global theorem uses only the convexity inequality on the segment joining x^\star to an arbitrary x \in \Omega. If x^\star were only locally optimal and some distant x were strictly better, then the convex combination near x^\star would already contradict local optimality.

  2. The first-order necessary condition is proved by differentiating the function

    \phi(t) := f(x^\star+t(x-x^\star))

    at t = 0^+, using global minimality of x^\star over a convex feasible set.

  3. The global linear lower bound is proved by applying convexity to

    f\bigl(x+t(y-x)\bigr)\le (1-t)f(x)+t f(y)

    for t \in (0,1], rearranging, and letting t \downarrow 0.

  4. The first-order characterization is the combination of the first-order necessary condition and the global linear lower bound,

    f(y)\ge f(x)+\langle \nabla f(x), y-x\rangle.

1.5. Exercises🔗

  1. Give an example of a nonconvex differentiable function on \mathbb{R}^2 that has a strict local minimizer that is not global. Then explain precisely where the local-to-global theorem fails.

  2. Prove that if C \subseteq \mathbb{R}^n is convex and f : C \to \mathbb{R} is strictly convex, then f has at most one global minimizer on C.

  3. Let f(x)=\max\{x_1,x_2,0\} on \mathbb{R}^2. Determine all global minimizers and explain why the differentiable convex first-order characterization does not apply.

  4. A set C \subseteq \mathbb{R}^n is midpoint-convex if

    \forall x,y \in C,\qquad \frac{x+y}{2}\in C.

    Prove that every convex set is midpoint-convex. Then prove that every closed midpoint-convex set is convex. Finally, give a counterexample showing that midpoint-convexity alone does not imply convexity.