## Introduction to Mathematical Optimization

Chapter 2 introduction to unconstrained optimization.

This chapter introduces what exactly an unconstrained optimization problem is. A detailed discussion of Taylor’s Theorem is provided and has been use to study the first order and second order necessary and sufficient conditions for local minimizer in an unconstrained optimization tasks. Examples have been supplied too in view of understanding the necessary and sufficient conditions better. The Python package scipy.optimize , which will form an integral part in solving many optimization problems in the later chapters of this book, is introduced too. The chapter ends with an overview of how an algorithm to solve unconstrained minimization problem works, covering briefly two procedures: line search descent method and trust region method .

## 2.1 The Unconstrained Optimization Problem

As we have discussed in the first chapter, an unconstrained optimization problem deals with finding the local minimizer \(\mathbf{x}^*\) of a real valued and smooth objective function \(f(\mathbf{x})\) of \(n\) variables, given by \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) , formulated as,

\[\begin{equation} \underset{\mathbf{x} \in \mathbb{R}^n}{\min f(\mathbf{x})} \tag{2.1} \end{equation}\] with no restrictions on the decision variables \(\mathbf{x}\) . We work towards computing \(\mathbf{x}^*\) , such that \(\forall\ \mathbf{x}\) near \(\mathbf{x}^*\) , the following inequality is satisfied: \[\begin{equation} f(\mathbf{x}^*) \leq f(\mathbf{x}) \tag{2.2} \end{equation}\]

## 2.2 Smooth Functions

In terms of analysis, the measure of the number of continuous derivative a function has, characterizes the smoothness of a function.

Some examples of smooth functions are , \(f(x) = x\) , \(f(x)=e^x\) , \(f(x)=\sin(x)\) , etc. To study the local minima \(\mathbf{x}^*\) of a smooth objective function \(f(\mathbf{x})\) , we emphasize on Taylor’s theorem for a multivariate function , thus focusing on the computations of the gradient vector \(\nabla f(\mathbf{x})\) and the Hessian matrix \(\mathbf{H} f(\mathbf{x})\) .

## 2.3 Taylor’s Theorem

The \(m-\) th order Taylor polynomial of the function \(f\) around the point \(p\) is given by: \[\begin{align} P_m(x)&=f(p) + (x - p)f^{'}(p) + \frac{(x-p)^2}{2!}f^{''}(p) + \ldots \\ &+ \frac{(x-p)^m}{m!}f^m(p) \tag{2.5} \end{align}\]

Now let \(f\) be a smooth, continuously differentiable function that takes in multiple variables, i.e, \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) and \(\mathbf{x}, \mathbf{p}, \mathbf{\delta} \in \mathbb{R}^n\) , where \(\mathbf{\delta}\) is the direction in which the line \(\mathbf{x} = \mathbf{p}+\alpha \mathbf{\delta}\) passes through the point \(\mathbf{p}\) [ Snyman, Jan A. Practical mathematical optimization. Springer Science+ Business Media, Incorporated, 2005. ]. Here, \(\alpha \in [0,1]\) . We have, \[\begin{equation} f(\mathbf{x}) = f(\mathbf{p} + \alpha \mathbf{\delta}) \tag{2.6} \end{equation}\]

From the definition of the directional derivative , we get, \[\begin{equation} \frac{df(\mathbf{x})}{d\alpha}|\mathbf{\delta} = \nabla^T f(\mathbf{x})\mathbf{\delta}=\hat{f}(\mathbf{x}) \tag{2.7} \end{equation}\] Again, differentiating \(\hat{f}(\mathbf{x})\) with respect to \(\alpha\) .

\[\begin{equation} \frac{d \hat{f}(\mathbf{x})}{d \alpha}|\mathbf{\delta} = \frac{d^2 f(\mathbf{x})}{d \alpha^2}=\nabla^T\hat{f}(\mathbf{x})\mathbf{\delta}=\mathbf{\delta}^T\mathbf{H}f(\mathbf{x})\mathbf{\delta} \tag{2.8} \end{equation}\]

So, using equations (2.5) and (2.6) we can generate the Taylor expansion for a multivariable function at a point \(\mathbf{p}\) . So, around \(\alpha = 0\) , we get, \[\begin{equation} f(\mathbf{x}) = f(\mathbf{p}+\alpha \mathbf{\delta}) = f(\mathbf{p}) + \nabla^Tf(\mathbf{p})\alpha \mathbf{\delta} + \frac{1}{2}\alpha \mathbf{\delta}^T\mathbf{H} f(\mathbf{p})\alpha \mathbf{\delta} + \ldots \tag{2.9} \end{equation}\]

The truncated Taylor expansion of the multivariable function, where the higher order terms are ignored, can be written as, \[\begin{equation} f(\mathbf{x}) = f(\mathbf{p}+\alpha \mathbf{\delta}) = f(\mathbf{p}) + \nabla^Tf(\mathbf{p})\alpha \mathbf{\delta} + \frac{1}{2}\alpha \mathbf{\delta}^T\mathbf{H} f(\mathbf{p}+\beta\mathbf{\delta})\alpha \mathbf{\delta} \tag{2.10} \end{equation}\] where, \(\beta \in [0,1]\) .

## 2.4 Necessary and Sufficient Conditions for Local Minimizer in Unconstrained Optimization

2.4.1 first-order necessary condition.

If there exists a local minimizer \(\mathbf{x}^*\) for a real-valued smooth function \(f(\mathbf{x}): \mathbb{R}^n \rightarrow \mathbb{R}\) , in an open neighborhood \(\subset \mathbb{R}^n\) of \(\mathbf{x}^*\) along the direction \(\mathbf{\delta}\) , then the first order necessary condition for the minimizer is given by: \[\begin{equation} \nabla^Tf(\mathbf{x}^*)\mathbf{\delta}=0\ \forall\ \mathbf{\delta} \neq 0 \tag{2.11} \end{equation}\] i.e, the directional derivative is \(0\) , which ultimately reduces to the equation: \[\begin{equation} \nabla f(\mathbf{x}^*)=0 \tag{2.12} \end{equation}\]

Proof. Let a real-valued smooth function \(f(\mathbf{x})\) be differentiable at the point \(\mathbf{x}^* \in \mathbb{R}^n\) . Using the Taylor expansion , we can write:

\[\begin{equation} f(\mathbf{x})=f(\mathbf{x}^*) + \nabla^T f(\mathbf{x}^*) (\mathbf{x} - \mathbf{x}^*)+\sum_{|\gamma|\leq m}\frac{\mathfrak{D}^{\gamma}f(\mathbf{x}^*)}{\gamma!}(\mathbf{x}-\mathbf{x}^*)^{\gamma} + \sum_{|\gamma|=m}j_{\gamma}(\mathbf{x})(\mathbf{x} - \mathbf{x}^*)^{\gamma} \tag{2.13} \end{equation}\]

where \(\mathfrak{D}\) represents the differential and \(m\) is the smoothness of the objective function \(f\) . Also \(\lim_{\mathbf{x} \to \mathbf{x}^*}j_{\gamma(\mathbf{X})}=0\) . Clubbing together the higher order terms, we can write Eq. (2.13) as, \[\begin{equation} \label{eq:2.14} f(\mathbf{x})=f(\mathbf{x}^*) + \nabla^T f(\mathbf{x}^*) (\mathbf{x} - \mathbf{x}^*)+ \mathcal{O}(\|\mathbf{x} - \mathbf{x}^*\|) \tag{2.14} \end{equation}\]

In this case, \[\begin{equation} \lim_{\mathbf{x} \to \mathbf{x}^*}\frac{\mathcal{O}(\|\mathbf{x} - \mathbf{x}^*\|)}{\|\mathbf{x} - \mathbf{x}^*\|}=0 \tag{2.15} \end{equation}\] Let us consider, \(\mathbf{x} = \mathbf{x}^*-\beta \nabla f(\mathbf{x}^*)\) , where \(\beta \in [0,1]\) . From Eq. (2.14) we can write, \[\begin{equation} f(\mathbf{x}^*-\beta \nabla f(\mathbf{x}^*))=f(\mathbf{x}^*)-\beta\|\nabla f(\mathbf{x}^*)\|^2+\mathcal{O}(\beta\|\nabla f(\mathbf{x}^*)\|) \tag{2.16} \end{equation}\]

Now, dividing Eq. (2.16) - \(f(\mathbf{x}^*)\) by \(\beta\) , we get,

\[\begin{equation} \frac{f(\mathbf{x}^*-\beta \nabla f(\mathbf{x}^*))-f(\mathbf{x}^*)}{\beta} = -\|\nabla f(\mathbf{x}^*)\|^2 + \frac{\mathcal{O}(\beta \|\nabla f(\mathbf{x}^*)\|)}{\beta} \geq 0 \tag{2.17} \end{equation}\]

Now, considering the limit \(\beta \to 0^{+}\) , we get, \[\begin{equation} -\|\nabla f(\mathbf{x}^*)\|^2 \leq 0 \tag{2.18} \end{equation}\]

Example 2.1 The Rosenbrock function of \(n\) -variables is given by: \[\begin{equation} f(\mathbf{x}) = \sum_{i=1}^{n-1}(100(x_{i+1}-x_i^2)^2 + (1-x_i)^2) \tag{2.20} \end{equation}\]

where, \(\mathbf{x} \in \mathbb{R}^n\) . For this example let us consider the Rosenbrock function for two variables, given by: \[\begin{equation} f(\mathbf{x}) = 100(x_2-x_1^2)^2+(1-x_1)^2 \tag{2.21} \end{equation}\]

We will show that the first order necessary condition is satisfied for the local minimizer \(\mathbf{x^*}=\begin{bmatrix} 1 \\ 1 \end{bmatrix}\) . We first check whether \(\mathbf{x}^*\) is a minimizer or not. Putting \(x_1=x_2=1\) in \(f(\mathbf{x})\) , we get \(f(\mathbf{x})=0\) . Now, we check whether the \(\mathbf{x^*}\) satisfies the first order necessary condition. For that we calculate \(\nabla f(\mathbf{x}^*)\) . \[\begin{equation} \nabla f(\mathbf{x}^*) = \begin{bmatrix} -400x_1(x_2-x_1)^2-2(1-x_1) \\ 200(x_2-x_1^2)\end{bmatrix}_{\mathbf{x}^*} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \tag{2.22} \end{equation}\]

So, we see that the first order necessary condition is satisfied. We can do similar analysis using the scipy.optimize package in Python. The Scipy official reference states that the scipy.optimize package provides the user with many commonly used optimization algorithms and test functions. It packages the following functionalities and aspects:

- Minimization of multivariate scalar objective functions covering both the unconstrained and constrained domains, using a range of optimization algorithms,
- Algorithms for minimization of scalar univariate functions,
- A variety of brute-force optimization algorithms, also called global optimization algorithms,
- Algorithms like minimization of least-squares and curve-fitting,
- Root finding algorithms, and
- Algorithms for solving multivariate equation systems.

the result is \(0.0\) . So \(\mathbf{x}^*\) is a minimizer. We then check for the first order necessary condition, using the gradient:

This matches with our calculations and also satisfies the first-order necessary condition.

## 2.4.2 Second-Order Necessary Conditions

If there exists a local minimizer \(\mathbf{x}^*\) for a real-valued smooth function \(f(\mathbf{x}): \mathbb{R}^n \rightarrow \mathbb{R}\) , in an open neighborhood \(\subset \mathbb{R}^n\) of \(\mathbf{x}^*\) along the feasible direction \(\mathbf{\delta}\) , and \(\mathbf{H} f(\mathbf{x})\) exists and is continuous in the open neighborhood, then the second order necessary conditions for the minimizer are given by: \[\begin{equation} \nabla^T f(\mathbf{x}^*)\mathbf{\delta} = 0, \forall\ \mathbf{\delta} \neq 0 \tag{2.23} \end{equation}\] and \[\begin{equation} \mathbf{\delta}^T\mathbf{H}f(\mathbf{x}^*)\mathbf{\delta} \geq 0, \forall\ \mathbf{\delta} \neq 0 \tag{2.24} \end{equation}\] which reduces to the following: \[\begin{equation} \nabla f(\mathbf{x}^*) = 0 \tag{2.25} \end{equation}\] and \[\begin{equation} \mathbf{\delta}\mathbf{H}f(\mathbf{x}^*) \geq 0 \tag{2.26} \end{equation}\] where equation Eq. (2.26) means that the Hessian matrix should be positive semi-definite.

Proof. From the proof of first order necessary condition, we know that \(\nabla f(\mathbf{x}^*) = 0\) , which satisfies equation Eq. (2.25) . Now for proving equation Eq. (2.26) , we use the Taylor series expansion again. We have, \[\begin{equation} f(\mathbf{x}) = f(\mathbf{x}^*)+\nabla^T f(\mathbf{x}^*)(\mathbf{x} - \mathbf{x}^*) + \frac{1}{2}\mathbf{H} f(\mathbf{x}^*)(\mathbf{x} - \mathbf{x}^*) + \mathcal{O}(\|\mathbf{x} - \mathbf{x}^*\|^2) \tag{2.27} \end{equation}\]

Given, the feasible direction \(\mathbf{\delta}\) , we take \(\beta \in [0,1]\) , such that \(\mathbf{x} = \mathbf{x}^* + \beta \mathbf{\delta}\) . Putting this in Eq. (2.27) and rearranging, we get, \[\begin{equation} f(\mathbf{x}^* + \beta \mathbf{\delta}) - f(\mathbf{x}^*) = \beta \nabla^T f(\mathbf{x}^*)\mathbf{\delta} + \frac{\beta^2}{2}\mathbf{\delta}^T \mathbf{H} f(\mathbf{x}^*)\mathbf{\delta} + \mathcal{O}(\beta^2) \tag{2.28} \end{equation}\]

As, \(\nabla f(\mathbf{x}^*)=0\) , we have \[\begin{equation} f(\mathbf{x}^* + \beta \mathbf{\delta}) - f(\mathbf{x}^*) = \frac{\beta^2}{2}\mathbf{\delta}^T \mathbf{H} f(\mathbf{x}^*)\mathbf{\delta} + \mathcal{O}(\beta^2) \tag{2.29} \end{equation}\]

Now, dividing Eq. (2.29) by \(\beta^2\) , we have, \[\begin{equation} \frac{f(\mathbf{x}^* + \beta \mathbf{\delta}) - f(\mathbf{x}^*)}{\beta^2} = \frac{1}{2}\mathbf{\delta}^T \mathbf{H} f(\mathbf{x}^*)\mathbf{\delta} + \frac{\mathcal{O}(\beta^2)}{\beta^2} \geq 0 \tag{2.30} \end{equation}\]

## 2.4.3 Second-Order Sufficient Conditions

For a real-valued smooth objective function \(f(\mathbf{x})\) , if \(\mathbf{x}^*\) is its local minimizer and \(\mathbf{H} f(\mathbf{x})\) exists and is continuous in an open neighborhood \(\subset \mathbb{R}^n\) of \(\mathbf{x}^*\) along the feasible direction \(\mathbf{\delta}\) , then the conditions: \[\begin{equation} \nabla^T f(\mathbf{x}^*)\delta = 0 \tag{2.33} \end{equation}\] i.e, \[\begin{equation} \nabla f(\mathbf{x}^*) = 0 \tag{2.34} \end{equation}\] and

\[\begin{equation} \mathbf{\delta}^T \mathbf{H} f(\mathbf{x}^*) \mathbf{\delta} > 0 \tag{2.35} \end{equation}\]

i.e, a positive definite Hessian matrix imply that \(\mathbf{x}^*\) is a strong local minimizer of \(f(\mathbf{x})\) .

Proof. Let \(\mathfrak{r} >0\) be a radius such that the Hessian of the objective function, \(\mathbf{H} f(\mathbf{x})\) is positive definite \(\forall\ \mathbf{x}\) in the open ball defined by \(\mathfrak{B} = \{\mathbf{y} \mid \|\mathbf{y} - \mathbf{x}^*\| \leq \mathfrak{r}\}\) . This comes from the fact that \(\mathbf{H} f(\mathbf{x})\) is positive definite at the local minimizer \(\mathbf{x}^*\) . Now, considering a nonzero vector \(\mathbf{\eta}\) , such that \(\|\mathbf{\eta}\| < \mathfrak{r}\) and \(c \in [0,1]\) , we will have, \(\mathbf{x}=\mathbf{x}^*+c\mathbf{\eta} \in \mathfrak{B}\) . Now, using the Taylor’s expansion, i.e, Eq. (2.27) , we will have,

\[\begin{equation} f(\mathbf{x}^* + \mathbf{\eta}) = f(\mathbf{x}^*) + c\nabla^Tf(\mathbf{x}^*)\mathbf{\eta} + \frac{c^2}{2}\mathbf{\eta}^T\mathbf{H} f(\mathbf{y})\mathbf{\eta} + \mathcal{O}(c^2) \tag{2.36} \end{equation}\]

Since, \(\nabla f(\mathbf{x}^*) = 0\) , we have, \[\begin{equation} \frac{f(\mathbf{x}^* + \mathbf{\eta}) - f(\mathbf{x}^*)}{c^2} = \frac{1}{2}\mathbf{\eta}^T\mathbf{H}f(\mathbf{y})\mathbf{\eta}+\frac{\mathcal{O}(c^2)}{c^2} \tag{2.37} \end{equation}\]

Here, \(\mathbf{y} = \mathbf{x}^* + \beta \mathbf{\eta}\) with \(\beta \in [0,1]\) . As, \(\mathbf{y} \in \mathfrak{B}\) , we have, \[\begin{equation} \mathbf{\eta}^T\mathbf{H}f(\mathbf{y})\mathbf{\eta} > 0 \tag{2.38} \end{equation}\]

So, taking the limit, \(c \to 0\) , we have, \[\begin{equation} f(\mathbf{x}^* + \mathbf{\eta}) - f(\mathbf{x}^*) > 0 \tag{2.39} \end{equation}\]

Example 2.2 Let us now work with a new test function called Himmelblau’s function, given by, \[\begin{equation} f(\mathbf{x}) = (x_1^2+x_2-11)^2+(x_1+x_2^2-7)^2 \tag{2.41} \end{equation}\]

We then check whether x_star is a minimizer.

Now, we calculate the gradient vector and the Hessian matrix of the function at x_star and look at the results,

We see that x1 satisfies the second order sufficient conditions and is a strong local minimizer. We wanted to perform the analyses using autograd package instead of scipy.optimize , because there might be cases when we need to use test functions that are not predefined in scipy.optimize package, unlike the Rosenbrock function .

## 2.5 Algorithms for Solving Unconstrained Minimization Tasks

An optimization algorithm for solving an unconstrained minimization problem requires an initial point \(\mathbf{x}_0\) to start with. The choice of \(\mathbf{x}_0\) depends either on the applicant who has some idea about the data and the task at hand or it can be set by the algorithm in charge. Starting with \(\mathbf{x}_0\) , the optimization algorithm iterates through a sequence of successive points \(\{\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_{\infty}\}\) , which stops at the point where all the termination conditions are met for approximating the minimizer \(\mathbf{x}^*\) . The algorithm generates this sequence taking into consideration the objective function \(f(\mathbf{x})\) at a particular point \(f(\mathbf{x}_n)\) . A new iterate \(\mathbf{x}_{n+1}\) is added in the sequence if the condition \(f(\mathbf{x}_{n+1}) < f(\mathbf{x}_n)\) . Although in many special cases, the algorithm might fail to find a new point in each and every step following the above condition, it must satisfy that after some stipulated number \(k\) of steps, the following condition is met: \[f(\mathbf{x}_{n+k}) < f(\mathbf{x}_n)\] . One of the important terminating conditions, for example, is to check whether the first order necessary condition is sufficiently accurate, for a smooth objective function, i.e, \(\|\nabla f(\mathbf{x}_{\infty})\| < \epsilon\) , where \(\epsilon\) is the infinitesimal tolerance value. We will discuss these conditions further in the subsequent chapters.

Fundamentally, there are two approaches available to generate \(f(\mathbf{x}_{n+1})\) from \(f(\mathbf{x}_n)\) :

Line Search Descent Method : Using this method, the optimization algorithm first picks a direction \(\mathbf{\delta}_n\) for the \(n^{th}\) step and performs a search along this direction from the previous generated iterate \(\mathbf{x}_{n-1}\) to find a new iterate \(\mathbf{x}_n\) such that the condition \(f(\mathbf{x}_n) < f(\mathbf{x}_{n-1})\) is satisfied. A direction \(\mathbf{\delta}_n\) is selected for the next iterate if the following condition is satisfied:

\[\begin{equation} \nabla^T f(\mathbf{x}_{n-1})\mathbf{\delta}_n < 0 \tag{2.42} \end{equation}\]

i.e, if the directional derivative in the direction \(\mathbf{\delta}_n\) is negative. Here \(f\) is the objective function. In view of that, the algorithm then needs to ascertain a distance by which it has to move along the direction \(\mathbf{\delta}_n\) to figure out \(\mathbf{x}_n\) . The distance \(\beta >0\) , which is called the step length , can be figured out by solving the one-dimensional minimization problem formulated as:

\[\begin{equation} \underset{\beta > 0}{\min} \tilde{f}(\beta) = \underset{\beta > 0}{\min} f(\mathbf{x}_{n-1} + \beta \mathbf{\delta}_n) \tag{2.43} \end{equation}\]

Line Search Descent Method

Trust Region Method : Using this method, the optimization algorithm develops a model function [refer to Nocedal & Wright], \(M_n\) , such that its behavior inside a boundary set around the current iterate \(\mathbf{x}_n\) matches that of the objective function \(f(\mathbf{x}_n)\) at that point. The model function is not expected to give a reasonable approximation to the behavior of the objective function at a point \(\mathbf{x}_t\) which is far away from \(\mathbf{x}_n\) , i.e, not lying inside the boundary defined around \(\mathbf{x}_n\) . As a result, the algorithm obstructs the search for the minimizer of \(M_n\) inside the boundary region, which is actually called the trust region , denoted by \(\mathcal{T}\) , before finding the step \(\mathbf{\zeta}\) , by solving the minimization problem formulated by:

\[\begin{equation} \underset{\mathbf{\zeta}}{\min\ } M_n(\mathbf{x}_n+\mathbf{\zeta}),\text{ where }\mathbf{x}_n+\mathbf{\zeta}\in \mathcal{T} \tag{2.44} \end{equation}\] Using this \(\mathbf{\zeta}\) , if the decrease in the value of \(f(\mathbf{x}_{n+1})\) from \(f(\mathbf{x}_n)\) is not sufficient, it can be inferred that the selected trust region is unnecessarily large. The algorithm then reduces the size of \(\mathcal{T}\) accordingly and re-solves the problem given by equation Eq. (2.44) . Most often, the trust region \(\mathcal{T}\) is defined by a circle in case of a two dimensional problem or a sphere in case of a three dimensional problem of radius \(\mathcal{T_r}>0\) , which follows the condition \(\|\mathbf{\zeta}\| \leq \mathcal{T_r}\) . In special cases, the shape of the trust region might vary. The form of the model function is given by a quadratic function, given by, \[\begin{equation} M_n(\mathbf{x}_n+\mathbf{\zeta}) = f(\mathbf{x}_n) + \mathbf{\zeta}^T\nabla f(\mathbf{x}_n)+\frac{1}{2}\mathbf{\zeta}^T\mathbf{B} f(\mathbf{x}_n)\mathbf{\zeta} \tag{2.45} \end{equation}\] Where, \(\mathbf{B}f(\mathbf{x}_n)\) is either the Hessian matrix \(\mathbf{H}f(\mathbf{x}_n)\) or an approximation to it.

Before moving into detailed discussions on line search descent methods and trust region methods in the later chapters, we will first deal with solving equation Eq. (2.43) in the immediate next chapter, which is itself an unconstrained one dimensional minimization problem, where we have to solve for \[\underset{\beta>0}{\min\ }\tilde{f}(\beta)\] and deduce the value of \(\beta^*\) , which is the minimizer for \(\tilde{f}(\beta)\) .

ChemE PhD in Progress

© 2024. MIT License.

## Unconstrained Optimization - Explanation and Examples

In plain terms, optimization is the task of solving for the best option given a goal and some constraints. In today’s post, we are going to look into solving convex optimization problems without constraints. In more mathematic terms, we would read the next line as “minimize f subject to x, such that x is in $\mathcal{X}$”. Where $\mathcal{X}= \mathcal{R} \text{(real numbers)}, \mathcal{R}^{n}\text{(vectors)}$ even letting $x$ be a function itself.

Unconstrained optimization crops up in many fields under different names, and understanding how to solve these types of problems is incredibly powerful. Some examples include regressing a linear function onto the data set. Here we are minimizing the sum of squared error.

Or, more generally, we could regress any arbitrary function to a dataset in this manner. Here is an example of regressing a $k^{th}$ order polynomial to a dataset.

These types of optimization problems are typically known as min-norm problems.

In a more complicated example, say we had an objective function (our goal) that took in a function, such as finding the path between two points that minimizes the length you need to travel. Here $x$ belongs to $\mathcal{L}_2$, the space of all square integrable functions. Clearly, the solution is a line between the two points, but the solution process is not intuitive.

## Optimality Conditions - What makes a solution

How do we know we have a minimum of $f(x)$? We need to use necessary and sufficient conditions to guild us on this path.

- Necessary - $A\implies B$, example “If I square an integer then it is positive”. While all squared integers are positive it is not enough (sufficient) to prove that my positive integer is a perfect square.
- Sufficient - $A\impliedby B$, to continue the example “If the square root of my number is a positive integer then my number is a perfect square”. This property is sufficient to prove that my number is in fact a perfect square.
- If and only if - $A\iff B$, if it is both necessary and sufficient. A implies B, and B implies A.

Here I will list the necessary and the sufficient conditions for unconstrained optimization (of functions with continuous first and second derivatives).

Necessary - $\nabla f(x^ * ) = 0, \nabla^2f(x^ * )\succcurlyeq 0 \implies f(x^ * )\leq f(x),x\in\mathcal{X}$

Sufficient -$\nabla f(x^ * ) = 0, \nabla^2f(x^ * )\succ 0 \iff f(x^ * )< f(x),x\in\mathcal{X}$

I will do a series of worked-out examples to show the application of these conditions.

## Worked out example of finding a minimum of a 1D Function

For the first example, let’s start easy.

If we apply the sufficiency conditions we have $\nabla f(x) = 2x=0$ and $\nabla^2 f(x) = 2 >0$. So we can solve the first equation for $x^ * = 0$,then check $\nabla^2 f(x^* ) > 0$ so we can state that $x^ * = 0$ is the global optimum (since this function is convex, local optimality $\implies$ global optimality).

And how we would solve it in python 3.7 using SymPy.

## Worked out example of finding a minimum of a Vector Function

Now for something more challenging. Here we have a function that takes in a vector $x$ and returns the sum of the vector components’ squares.

We need to apply the conditions.

\(\nabla f(x)_i = 2x_i = 0 \rightarrow x_i = 0\) \(\nabla^2 f(x) = 2I_n \succ 0\)

So $x^* $ is the zero vector

And how we would solve it in python 3.7 using SymPy for the case n=3.

## Worked out example of finding a minimum of a Vector Function (Part 2)

What if we have something more complicated? Here we are assuming that $Q$ is positive definite, and the dimensions of the vectors and matrices are consistent.

Again we apply the conditions.

Here we see the first increase in computational difficulty. We have to solve a system of linear equations which can be nontrivial at large $n$. And by the fact that $Q$ is positive definite, we automatically pass the second condition.

And how we would solve it in python 3.7 using NumPy.

## Worked out example of Linear Regression

Here we look at regressing the slope and intercepting a linear model onto some data $y$.

Remember that we aren’t taking the partial derivative with respect to $x$ but with respect to $\alpha_0$ and $\alpha_1$, because we are trying to find the optimal parameters for this regression.

With a little rearranging, we can see that this is the same as solving a $2\times2$ linear system.

And the solution is near identical in form to the previous example. We can see there that $\vec{\alpha} = Q^{-1}c$,

## Worked out example of Shortest Path

We will solve the shortest path between 2 points by utilizing some aspects of calculus of variations. By a the same argument of $\nabla f(x) = 0$ for a function, we can derive the celebrated Euler-Lagrange Equation. For a functional of this form.

Here, to get the solution, we need to solve a differential equation instead of a system of linear equations.

substituting these back in

With a small amount of rearrangement, we see that $x’(t)=\frac{c}{\sqrt{1-c^2}}$, in other words, the function we are looking for is a straight line. I will leave the python code up to the reader :)

- Lecture Slides on the Unconstrained Minimization
- Video on first and second order optimality conditions

## Related Posts

Branch and bound of qubo, rust conversion, branching 07 feb 2024, branch and bound of qubo, using heuristics 31 jan 2024, branch and bound of qubo, random selection 10 oct 2023.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

## Unconstrained optimization

Optimization.

A problem devoid of constraints is, well, an unconstrained optimization problem. Much of modern machine learning and deep learning depends on formulating and solving an unconstrained optimization problem, by incorporating constraints as additional elements of the loss with suitable penalties.

In this article, we will give a broad intuition on solving unconstrained optimization problems.

## Prerequisites

To understand this article, we recommend familiarity with the concepts in

- Mathematical notation
- Functions in mathematics
- Derivatives
- Higher-order derivatives
- Taylor's Theorem for function approximation
- Identifying minima, maxima, and saddle points

Follow the above links to first get acquainted with the corresponding concepts.

## Problem statement

Unconstrained optimization problems are easy to express as they are devoid of constraints.

In this article, we will consider an objective function \( f : \real^\ndim \to \real \) that we wish to minimize. We express this as,

\begin{aligned} \min_{\vx} \text{ }& f(\vx) \\ \end{aligned}

If we are interested in the value of \( \vx \) that achieves the minimum, we shall write this as

\begin{aligned} \star{\vx} = \argmin_{\vx} \text{ }& f(\vx) \\ \end{aligned}

In machine learning, we are usually interested in this latter formulation. In that context, \( \star{\vx} \) denotes the parameters of the model being fit to the training data, and \( f(\vx) \) denotes the loss that we wish to minimize, as a function of the parameters.

## Calculus recap: Maxima and minima

Here are some essential facts about function maxima, minima, and saddle points that we already know from calculus .

- At the stationary points, the gradient of a function is zero.
- At a local minimum, the Hessian is positive definite.
- At a local maximum, the Hessian is negative definite.
- At other stationary points, such as saddle points, the Hessian is indefinite.

Thus, to find a local minimum, we need to find points where the gradient is zero and the Hessian is positive definite .

If we can enumerate all such local minimizers, then the point with lowest function value among the local minimizers, is the global minimizer.

## Conditions for optimality

The conditions for a point \( \star{\vx} \) to be a local minimizer can be mathematically expressed as

- \( f(\star{\vx}) \le f(\vx)\) , for all \(\vx \in \mathcal{N}_{\star{vx}} \), where \( \mathcal{N}_{\star{\vx}} \) is the immediate neighborhood of \( \star{\vx} \).
- \( \nabla f(\star{\vx} ) = 0 \)
- \( \nabla^2 f(\star{\vx}) \) is positive definite. In other words, \( \vy^T \nabla^2 f(\star{\vx}) \vy > 0 \), for all \( \vy \in \real^n \).

The first condition above is just the definition of a local minimizer in a neighborhood.

The condition 2 above is known as the first-order necessary condition for a local minimizer as it involves only the first-order derivatives. Any minimizer is a stationary point. Therefore it must satisfy this condition. Hence it is deemed a necessary condition. But, it is not a sufficient condition, since it merely identifies a stationary point. Such a point could be a minimizer, maximizer, or a saddle point.

The conditions 2 and 3 above are together known as second-order sufficient conditions for a local minimizer as they involve the second-order derivatives as well. Any point satisfying both these conditions is guaranteed to be a local minimizer.

## Need for an iterative process

Equipped with this knowledge about gradients and Hessians, we can solve an unconstrained optimization problem as follows:

- Identify points where the gradient is zero
- Filter out the ones where the Hessian is not positive definite
- Among the remaining points, find the point with the least function value. That is the solution.

Practical optimization problems seldom lend themselves to straightforward solutions that can be identified with such steps. There are many practical challenges:

- Function being optimized may not be differentiable!
- It may be infeasible to compute the gradient or the Hessian
- Even if they may be computed, it may be infeasible to directly solve them to identify extrema.

Thus, optimization approaches typically utilize an intuitive iterative approach:

Start from an arbitrary point and keep moving towards a point with lower function value, until no further reduction in function value is possible.

## Choosing the next point

Optimization is an iterative process, progressing through points with lower function values, until convergence. The key question is

From a given point, how do we identify a point with a lower function value?

It all depends on being able to model the function value at any new point \( f(\vx_{i+1}) \) based on the function value at the current point \( f(\vx_i) \). This is the realm of Taylor's Theorem , a strong tool for function approximation that we introduced in our module on calculus .

Taylor's Theorem is by far the most important concept to understand the theory and reasoning behind optimization algorithms, as you will soon realize.

## Taylor's Theorem and optimization

Owing to the Taylor's Theorem, for a twice differentiable function \( f(\vx) \), the value of the function at a point \(\va+\vp \) can be written in terms of the function value at a point \( \va \) as follows.

\begin{equation} f(\va+\vp) = f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \label{eqn:second-order-taylor-approx} \end{equation}

Here, \( \nabla_{\va} \) and \( \mH_{\va} \) denote the gradient and Hessian of the function, evaluated at the point \( \va \).

To choose the next point \( \va + \vp \), we need to discover a suitable value of \( \vp \). Since \( \va \) is already known (the current point), this is a function in \( \vp \). At its local minimizer, \( \star{\vp} \), the gradient with respect to \( \vp \) will be zero.

\begin{aligned} \doh{f(\va + \star{\vp})}{\vp} = 0 \\\\ \label{eqn:first-order-condition-on-p} \end{aligned}

Thus, we now have a system of equations to solve for the value of \( \vp \) to choose our next point \( \va + \vp \).

\begin{aligned} \frac{\partial}{\partial \vp} \left( f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \right) = 0 \end{aligned}

Methods for unconstrained optimization differ in ways of arriving at a suitable \( \vp \) from the above equation.

- Newton's method calculates \( \vp \) by solving the above equation in its entirety.
- quasi-Newton methods such as the BFGS method , approximate the second-order term. This may be necessary if the Hessian of the function is computationally infeasible.
- Gradient descent ignores the second-order term. They only work with the gradient term. Hence the name.

These methods are further adapted for scalability with extensions such as inexact Newton's method , L-BFGS , and stochastic gradient descent (SGD) .

We know from the Taylor's Theorem demo that including more terms in the expansion leads to better function approximation. Therefore, gradient descent is usually slower to converge than Newton's method. However, we often deal with complicated objective functions, with hard to compute Hessians. Hence, gradient descent, particularly SGD, is the preferred weapon of choice in modern machine learning and deep learning.

Follow the links in this section to study each of these approaches in details and build intuition through interactive problems.

## Gradient descent demo: \( \min x^2 \)

Let's see gradient descent in action on a simple univariate function \( f(x) = x^2 \), where \( x \in \real \).

Note that the function has a global minimum at \( x = 0 \). The goal of the gradient descent method is to discover this point of least function value, starting at any arbitrary point.

In this accompanying demo, change the starting point by dragging the orange circle and see the trajectory of gradient descent method towards the global minimum. Also study the effect of the learning rate on the convergence rate.

To know more about gradient descent and learning rate, head on to our comprehensive articles on these topics in optimization .

## Gradient descent demo: Himmelblau's function

As another example, let's apply gradient descent to a multivariate function with multiple stationary points .

We introduced Himmelblau's function in our article on multivariate functions in calculus . It has 4 local minima (highlighted in green) and 1 maximum (highlighted in blue).

Turns out, Himmelblau's, in spite of its bump, is actually quite easy for gradient descent.

## Large scale optimization

Newton's method, quasi-Newton method and the gradient descent method, all have their large-scale counterparts.

Inexact Newton methods work by directly calculating the product \( \inv{\mH} \nabla \) instead of computing them separately. This has the benefit of significant savings in terms of memory, since the Hessian is never stored explicitly and the product is a vector.

L-BFGS , the limited-memory version of the quasi-Newton BFGS method works on a similar principle. In addition to working with an approximated Hessian inverse, as BFGS, it also assumes a sparse Hessian, thereby significantly reducing the memory footprint and computations involved.

Stochastic gradient descent (SGD) , scales up the gradient descent approach by decomposing the large scale optimization problem into mini problems that could be optimized in a sequence to arrive at the final solution. This is by far the most general and easily implemented strategies for optimization. SGD and its variants are the weapon of choice for machine learning and deep learning frameworks such as Theano, PyTorch, and Tensorflow. We will investigate SGD and its variants in more detail in the next few sections.

## Where to next?

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject .

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus . Or start with strong mathematical foundations .

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning .

## Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

## Subscribe for article updates

Stay up to date with new material for free .

- Machine Learning Tutorial
- Data Analysis Tutorial
- Python - Data visualization tutorial
- Machine Learning Projects
- Machine Learning Interview Questions
- Machine Learning Mathematics
- Deep Learning Tutorial
- Deep Learning Project
- Deep Learning Interview Questions
- Computer Vision Tutorial
- Computer Vision Projects
- NLP Project
- NLP Interview Questions
- Statistics with Python
- 100 Days of Machine Learning

## Linear Algebra and Matrix

- Scalar and Vector
- Python Program to Add Two Matrices
- Python program to multiply two matrices
- Vector Operations
- Product of Vectors
- Scalar Product of Vectors
- Dot and Cross Products on Vectors
- Transpose a matrix in Single line in Python
- Transpose of a Matrix
- Adjoint and Inverse of a Matrix
- How to inverse a matrix using NumPy
- Determinant of a Matrix
- Program to find Normal and Trace of a matrix
- Data Science | Solving Linear Equations
- Data Science - Solving Linear Equations with Python
- System of Linear Equations
- System of Linear Equations in three variables using Cramer's Rule
- Eigenvalues
- Applications of Eigenvalues and Eigenvectors
- How to compute the eigenvalues and right eigenvectors of a given square array using NumPY?

## Statistics for Machine Learning

- Descriptive Statistic
- Measures of Central Tendency
- Measures of Dispersion | Types, Formula and Examples
- Mean, Variance and Standard Deviation
- Calculate the average, variance and standard deviation in Python using NumPy
- Random Variables
- Difference between Parametric and Non-Parametric Methods
- Probability Distribution
- Confidence Interval
- Mathematics | Covariance and Correlation
- Program to find correlation coefficient
- Robust Correlation
- Normal Probability Plot
- Quantile Quantile plots
- True Error vs Sample Error
- Bias-Variance Trade Off - Machine Learning
- Understanding Hypothesis Testing
- Paired T-Test - A Detailed Overview
- P-value in Machine Learning
- F-Test in Statistics
- Residual Leverage Plot (Regression Diagnostic)
- Difference between Null and Alternate Hypothesis
- Mann and Whitney U test
- Wilcoxon Signed Rank Test
- Kruskal Wallis Test
- Friedman Test
- Mathematics | Probability

## Probability and Probability Distributions

- Mathematics - Law of Total Probability
- Bayes's Theorem for Conditional Probability
- Mathematics | Probability Distributions Set 1 (Uniform Distribution)
- Mathematics | Probability Distributions Set 4 (Binomial Distribution)
- Mathematics | Probability Distributions Set 5 (Poisson Distribution)
- Uniform Distribution Formula
- Mathematics | Probability Distributions Set 2 (Exponential Distribution)
- Mathematics | Probability Distributions Set 3 (Normal Distribution)
- Mathematics | Beta Distribution Model
- Gamma Distribution Model in Mathematics
- Chi-Square Test for Feature Selection - Mathematical Explanation
- Student's t-distribution in Statistics
- Python - Central Limit Theorem
- Mathematics | Limits, Continuity and Differentiability
- Implicit Differentiation

## Calculus for Machine Learning

- Engineering Mathematics - Partial Derivatives
- Advanced Differentiation
- How to find Gradient of a Function using Python?
- Optimization techniques for Gradient Descent
- Higher Order Derivatives
- Taylor Series
- Application of Derivative - Maxima and Minima | Mathematics
- Absolute Minima and Maxima
- Optimization for Data Science

## Unconstrained Multivariate Optimization

- Lagrange Multipliers
- Lagrange's Interpolation
- Linear Regression in Machine learning
- Ordinary Least Squares (OLS) using statsmodels

## Regression in Machine Learning

minimize f(x), w.r.t x , subject to a < x < b

where, f(x) : Objective function x : Decision variable a < x < b : Constraint

## z = f(x 1 ,x 2 ,x 3 …..x n )

Min f(x̄) w.r.t x̄ x̄ ∈ r n, the necessary and sufficient conditions for x̄ * to be the minimizer of the function f(x̄ * ).

In case of multivariate optimization the necessary and sufficient conditions for x̄ * to be the minimizer of the function f(x̄) are: First-order necessary condition: ∇ f(x̄ * ) = 0 Second-order sufficiency condition: ∇ 2 f(x̄ * ) has to be positive definite. where, ,and

## Numerical Example

## Please Login to comment...

- data-science
- Machine Learning

## Improve your Coding Skills with Practice

## What kind of Experience do you want to share?

Help Center Help Center

- Help Center
- Trial Software
- Product Updates
- Documentation

## Unconstrained Nonlinear Optimization Algorithms

Unconstrained optimization definition.

Unconstrained minimization is the problem of finding a vector x that is a local minimum to a scalar function f ( x ):

min x f ( x )

The term unconstrained means that no restriction is placed on the range of x .

## fminunc trust-region Algorithm

Trust-region methods for nonlinear minimization.

Many of the methods used in Optimization Toolbox™ solvers are based on trust regions, a simple yet powerful concept in optimization.

To understand the trust-region approach to optimization, consider the unconstrained minimization problem, minimize f ( x ), where the function takes vector arguments and returns scalars. Suppose you are at a point x in n -space and you want to improve, i.e., move to a point with a lower function value. The basic idea is to approximate f with a simpler function q , which reasonably reflects the behavior of function f in a neighborhood N around the point x . This neighborhood is the trust region. A trial step s is computed by minimizing (or approximately minimizing) over N . This is the trust-region subproblem,

The current point is updated to be x + s if f ( x + s ) < f ( x ) ; otherwise, the current point remains unchanged and N , the region of trust, is shrunk and the trial step computation is repeated.

The key questions in defining a specific trust-region approach to minimizing f ( x ) are how to choose and compute the approximation q (defined at the current point x ), how to choose and modify the trust region N , and how accurately to solve the trust-region subproblem. This section focuses on the unconstrained problem. Later sections discuss additional complications due to the presence of constraints on the variables.

In the standard trust-region method ( [48] ), the quadratic approximation q is defined by the first two terms of the Taylor approximation to F at x ; the neighborhood N is usually spherical or ellipsoidal in shape. Mathematically the trust-region subproblem is typically stated

where g is the gradient of f at the current point x , H is the Hessian matrix (the symmetric matrix of second derivatives), D is a diagonal scaling matrix, Δ is a positive scalar, and ‖ . ‖ is the 2-norm. Good algorithms exist for solving Equation 2 (see [48] ); such algorithms typically involve the computation of all eigenvalues of H and a Newton process applied to the secular equation

1 Δ − 1 ‖ s ‖ = 0.

Such algorithms provide an accurate solution to Equation 2 . However, they require time proportional to several factorizations of H . Therefore, for large-scale problems a different approach is needed. Several approximation and heuristic strategies, based on Equation 2 , have been proposed in the literature ( [42] and [50] ). The approximation approach followed in Optimization Toolbox solvers is to restrict the trust-region subproblem to a two-dimensional subspace S ( [39] and [42] ). Once the subspace S has been computed, the work to solve Equation 2 is trivial even if full eigenvalue/eigenvector information is needed (since in the subspace, the problem is only two-dimensional). The dominant work has now shifted to the determination of the subspace.

The two-dimensional subspace S is determined with the aid of a preconditioned conjugate gradient process described below. The solver defines S as the linear space spanned by s 1 and s 2 , where s 1 is in the direction of the gradient g , and s 2 is either an approximate Newton direction, i.e., a solution to

or a direction of negative curvature,

The philosophy behind this choice of S is to force global convergence (via the steepest descent direction or negative curvature direction) and achieve fast local convergence (via the Newton step, when it exists).

A sketch of unconstrained minimization using trust-region ideas is now easy to give:

Formulate the two-dimensional trust-region subproblem.

Solve Equation 2 to determine the trial step s .

If f ( x + s ) < f ( x ) , then x = x + s .

These four steps are repeated until convergence. The trust-region dimension Δ is adjusted according to standard rules. In particular, it is decreased if the trial step is not accepted, i.e., f ( x + s ) ≥ f ( x ) . See [46] and [49] for a discussion of this aspect.

Optimization Toolbox solvers treat a few important special cases of f with specialized functions: nonlinear least-squares, quadratic functions, and linear least-squares. However, the underlying algorithmic ideas are the same as for the general case. These special cases are discussed in later sections.

## Preconditioned Conjugate Gradient Method

A popular way to solve large, symmetric, positive definite systems of linear equations Hp = – g is the method of Preconditioned Conjugate Gradients (PCG). This iterative approach requires the ability to calculate matrix-vector products of the form H·v where v is an arbitrary vector. The symmetric positive definite matrix M is a preconditioner for H . That is, M = C 2 , where C –1 HC –1 is a well-conditioned matrix or a matrix with clustered eigenvalues.

In a minimization context, you can assume that the Hessian matrix H is symmetric. However, H is guaranteed to be positive definite only in the neighborhood of a strong minimizer. Algorithm PCG exits when it encounters a direction of negative (or zero) curvature, that is, d T Hd ≤ 0 . The PCG output direction p is either a direction of negative curvature or an approximate solution to the Newton system Hp = – g . In either case, p helps to define the two-dimensional subspace used in the trust-region approach discussed in Trust-Region Methods for Nonlinear Minimization .

## fminunc quasi-newton Algorithm

Basics of unconstrained optimization.

Although a wide spectrum of methods exists for unconstrained optimization, methods can be broadly categorized in terms of the derivative information that is, or is not, used. Search methods that use only function evaluations (e.g., the simplex search of Nelder and Mead [30] ) are most suitable for problems that are not smooth or have a number of discontinuities. Gradient methods are generally more efficient when the function to be minimized is continuous in its first derivative. Higher order methods, such as Newton's method, are only really suitable when the second-order information is readily and easily calculated, because calculation of second-order information, using numerical differentiation, is computationally expensive.

Gradient methods use information about the slope of the function to dictate a direction of search where the minimum is thought to lie. The simplest of these is the method of steepest descent in which a search is performed in a direction, –∇ f ( x ) , where ∇ f ( x ) is the gradient of the objective function. This method is very inefficient when the function to be minimized has long narrow valleys as, for example, is the case for Rosenbrock's function

The minimum of this function is at x = [1,1] , where f ( x ) = 0 . A contour map of this function is shown in the figure below, along with the solution path to the minimum for a steepest descent implementation starting at the point [-1.9,2]. The optimization was terminated after 1000 iterations, still a considerable distance from the minimum. The black areas are where the method is continually zigzagging from one side of the valley to another. Note that toward the center of the plot, a number of larger steps are taken when a point lands exactly at the center of the valley.

Figure 5-1, Steepest Descent Method on Rosenbrock's Function

This function, also known as the banana function, is notorious in unconstrained examples because of the way the curvature bends around the origin. Rosenbrock's function is used throughout this section to illustrate the use of a variety of optimization techniques. The contours have been plotted in exponential increments because of the steepness of the slope surrounding the U-shaped valley.

For a more complete description of this figure, including scripts that generate the iterative points, see Banana Function Minimization .

## Quasi-Newton Methods

The optimal solution point, x *, can be written as

Newton-type methods (as opposed to quasi-Newton methods) calculate H directly and proceed in a direction of descent to locate the minimum after a number of iterations. Calculating H numerically involves a large amount of computation. Quasi-Newton methods avoid this by using the observed behavior of f ( x ) and ∇ f ( x ) to build up curvature information to make an approximation to H using an appropriate updating technique.

A large number of Hessian updating methods have been developed. However, the formula of Broyden [3] , Fletcher [12] , Goldfarb [20] , and Shanno [37] (BFGS) is thought to be the most effective for use in a general purpose method.

The formula given by BFGS is

s k = x k + 1 − x k , q k = ∇ f ( x k + 1 ) − ∇ f ( x k ) .

As a starting point, H 0 can be set to any symmetric positive definite matrix, for example, the identity matrix I . To avoid the inversion of the Hessian H , you can derive an updating method that avoids the direct inversion of H by using a formula that makes an approximation of the inverse Hessian H –1 at each update. A well-known procedure is the DFP formula of Davidon [7] , Fletcher, and Powell [14] . This uses the same formula as the BFGS method ( Equation 9 ) except that q k is substituted for s k .

The gradient information is either supplied through analytically calculated gradients, or derived by partial derivatives using a numerical differentiation method via finite differences. This involves perturbing each of the design variables, x , in turn and calculating the rate of change in the objective function.

At each major iteration, k , a line search is performed in the direction

The quasi-Newton method is illustrated by the solution path on Rosenbrock's function in Figure 5-2, BFGS Method on Rosenbrock's Function . The method is able to follow the shape of the valley and converges to the minimum after 140 function evaluations using only finite difference gradients.

Figure 5-2, BFGS Method on Rosenbrock's Function

## Line Search

Line search is a search method that is used as part of a larger optimization algorithm. At each step of the main algorithm, the line-search method searches along the line containing the current point, x k , parallel to the search direction , which is a vector determined by the main algorithm. That is, the method finds the next iterate x k +1 of the form

where x k denotes the current iterate, d k is the search direction, and α * is a scalar step length parameter.

The line search method attempts to decrease the objective function along the line x k + α * d k by repeatedly minimizing polynomial interpolation models of the objective function. The line search procedure has two main steps:

The bracketing phase determines the range of points on the line x k + 1 = x k + α * d k to be searched. The bracket corresponds to an interval specifying the range of values of α .

The sectioning step divides the bracket into subintervals, on which the minimum of the objective function is approximated by polynomial interpolation.

The resulting step length α satisfies the Wolfe conditions:

where c 1 and c 2 are constants with 0 < c 1 < c 2 < 1.

The first condition ( Equation 12 ) requires that α k sufficiently decreases the objective function. The second condition ( Equation 13 ) ensures that the step length is not too small. Points that satisfy both conditions ( Equation 12 and Equation 13 ) are called acceptable points .

The line search method is an implementation of the algorithm described in Section 2-6 of [13] . See also [31] for more information about line search.

## Hessian Update

Many of the optimization functions determine the direction of search by updating the Hessian matrix at each iteration, using the BFGS method ( Equation 9 ). The function fminunc also provides an option to use the DFP method given in Quasi-Newton Methods (set HessUpdate to 'dfp' in options to select the DFP method). The Hessian, H , is always maintained to be positive definite so that the direction of search, d , is always in a descent direction. This means that for some arbitrarily small step α in the direction d , the objective function decreases in magnitude. You achieve positive definiteness of H by ensuring that H is initialized to be positive definite and thereafter q k T s k (from Equation 14 ) is always positive. The term q k T s k is a product of the line search step length parameter α k and a combination of the search direction d with past and present gradient evaluations,

You always achieve the condition that q k T s k is positive by performing a sufficiently accurate line search. This is because the search direction, d , is a descent direction, so that α k and the negative gradient –∇ f ( x k ) T d are always positive. Thus, the possible negative term –∇ f ( x k +1 ) T d can be made as small in magnitude as required by increasing the accuracy of the line search.

## LBFGS Hessian Approximation

For large problems, the BFGS Hessian approximation method can be relatively slow and use a large amount of memory. To circumvent these issues, use the LBFGS Hessian approximation by setting the HessianApproximation option to 'lbfgs' . This causes fminunc to use the Low-memory BFGS Hessian approximation, described next. For the benefit of using LBFGS in a large problem, see Solve Nonlinear Problem with Many Variables .

As described in Nocedal and Wright [31] , the Low-memory BFGS Hessian approximation is similar to the BFGS approximation described in Quasi-Newton Methods , but uses a limited amount of memory for previous iterations. The Hessian update formula given in Equation 9 is

H k + 1 = H k + q k q k T q k T s k − H k s k s k T H k T s k T H k s k ,

Another description of the BFGS procedure is

where ɑ k is the step length chosen by line search, and H k is an inverse Hessian approximation. The formula for H k :

H k + 1 = V k T H k V k + ρ k s k s k T ,

where s k and q k are defined as before, and

ρ k = 1 q k T s k V k = I − ρ k q k s k T .

For the LBFGS algorithm, the algorithm keeps a fixed, finite number m of the parameters s k and q k from the immediately preceding iterations. Starting from an initial H 0 , the algorithm computes an approximate H k for obtaining the step from Equation 15 . The computation of H k ∇ f k proceeds as a recursion from the preceding equations using the most recent m values of ρ j , q j , and s j . For details, see Algorithm 7.4 of Nocedal and Wright [31] .

## Related Topics

- fminsearch Algorithm

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

- Switzerland (English)
- Switzerland (Deutsch)
- Switzerland (Français)
- 中国 (English)

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

- América Latina (Español)
- Canada (English)
- United States (English)
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- United Kingdom (English)

## Asia Pacific

- Australia (English)
- India (English)
- New Zealand (English)

Contact your local office

- school Campus Bookshelves
- menu_book Bookshelves
- perm_media Learning Objects
- login Login
- how_to_reg Request Instructor Account
- hub Instructor Commons
- Download Page (PDF)
- Download Full Book (PDF)
- Periodic Table
- Physics Constants
- Scientific Calculator
- Reference & Cite
- Tools expand_more
- Readability

selected template will load here

This action is not available.

## 2.7: Constrained Optimization - Lagrange Multipliers

- Last updated
- Save as PDF
- Page ID 2256

- Michael Corral
- Schoolcraft College

In Sections 2.5 and 2.6 we were concerned with finding maxima and minima of functions without any constraints on the variables (other than being in the domain of the function). What would we do if there were constraints on the variables? The following example illustrates a simple case of this type of problem.

Example 2.24

For a rectangle whose perimeter is 20 m, find the dimensions that will maximize the area.

The area \(A\) of a rectangle with width \(x\) and height \(y\) is \(A = x y\). The perimeter \(P\) of the rectangle is then given by the formula \(P = 2x+2y\). Since we are given that the perimeter \(P = 20\), this problem can be stated as:

\[\nonumber \begin{align}\text{Maximize : }&f (x, y) = x y \\[4pt] \nonumber \text{given : }&2x+2y = 20 \end{align}\]

The reader is probably familiar with a simple method, using single-variable calculus, for solving this problem. Since we must have \(2x + 2y = 20\), then we can solve for, say, \(y\) in terms of \(x\) using that equation. This gives \(y = 10− x\), which we then substitute into \(f\) to get \(f (x, y) = x y = x(10 − x) = 10x − x^2\). This is now a function of \(x\) alone, so we now just have to maximize the function \(f (x) = 10x− x^2\) on the interval [0,10]. Since \(f ′ (x) = 10−2x = 0 \Rightarrow x = 5 \text{ and }f ′′(5) = −2 < 0\), then the Second Derivative Test tells us that \(x = 5\) is a local maximum for \(f\), and hence \(x = 5\) must be the global maximum on the interval [0,10] (since \(f = 0\) at the endpoints of the interval). So since \(y = 10 − x = 5\), then the maximum area occurs for a rectangle whose width and height both are 5 m.

Notice in the above example that the ease of the solution depended on being able to solve for one variable in terms of the other in the equation \(2x+2y = 20\). But what if that were not possible (which is often the case)? In this section we will use a general method, called the Lagrange multiplier method , for solving constrained optimization problems:

\[\nonumber \begin{align} \text{Maximize (or minimize) : }&f (x, y)\quad (\text{or }f (x, y, z)) \\[4pt] \nonumber \text{given : }&g(x, y) = c \quad (\text{or }g(x, y, z) = c) \text{ for some constant } c \end{align}\]

The equation \(g(x, y) = c\) is called the constraint equation , and we say that \(x\) and \(y\) are constrained by \(g(x, y) = c\). Points \((x, y)\) which are maxima or minima of \(f (x, y)\) with the condition that they satisfy the constraint equation \(g(x, y) = c\) are called constrained maximum or constrained minimum points, respectively. Similar definitions hold for functions of three variables.

The Lagrange multiplier method for solving such problems can now be stated:

Theorem 2.7: The Lagrange Multiplier Method

Let \(f (x, y)\text{ and }g(x, y)\) be smooth functions, and suppose that \(c\) is a scalar constant such that \(\nabla g(x, y) \neq \textbf{0}\) for all \((x, y)\) that satisfy the equation \(g(x, y) = c\). Then to solve the constrained optimization problem

\[\nonumber \begin{align} \text{Maximize (or minimize) : }&f (x, y) \\[4pt] \nonumber \text{given : }&g(x, y) = c ,\end{align}\]

find the points \((x, y)\) that solve the equation \(\nabla f (x, y) = \lambda \nabla g(x, y)\) for some constant \(\lambda\) (the number \(\lambda\) is called the Lagrange multiplier ). If there is a constrained maximum or minimum, then it must be such a point.

A rigorous proof of the above theorem requires use of the Implicit Function Theorem, which is beyond the scope of this text. Note that the theorem only gives a necessary condition for a point to be a constrained maximum or minimum. Whether a point \((x, y)\) that satisfies \(\nabla f (x, y) = \lambda \nabla g(x, y)\) for some \(\lambda\) actually is a constrained maximum or minimum can sometimes be determined by the nature of the problem itself. For instance, in Example 2.24 it was clear that there had to be a global maximum.

So how can you tell when a point that satisfies the condition in Theorem 2.7 really is a constrained maximum or minimum? The answer is that it depends on the constraint function \(g(x, y)\), together with any implicit constraints. It can be shown that if the constraint equation \(g(x, y) = c\) (plus any hidden constraints) describes a bounded set \(B\) in \(\mathbb{R}^2\), then the constrained maximum or minimum of \(f (x, y)\) will occur either at a point \((x, y)\) satisfying \(\nabla f (x, y) = \lambda \nabla g(x, y)\) or at a “boundary” point of the set \(B\).

In Example 2.24 the constraint equation \(2x+2y = 20\) describes a line in \(\mathbb{R}^2\), which by itself is not bounded. However, there are “hidden” constraints, due to the nature of the problem, namely \(0 ≤ x, y ≤ 10\), which cause that line to be restricted to a line segment in \(\mathbb{R}^2\) (including the endpoints of that line segment), which is bounded.

Example 2.25

For a rectangle whose perimeter is 20 m, use the Lagrange multiplier method to find the dimensions that will maximize the area.

As we saw in Example 2.24, with \(x\) and \(y\) representing the width and height, respectively, of the rectangle, this problem can be stated as:

\[\nonumber \begin{align} \text{Maximize : }&f (x, y) = x y \\[4pt] \nonumber \text{given : }&g(x, y) = 2x+2y = 20 \end{align}\]

Then solving the equation \(\nabla f (x, y) = \lambda \nabla g(x, y)\) for some \(\lambda\) means solving the equations \(\dfrac{∂f}{∂x} = \lambda \dfrac{∂g}{∂x}\text{ and }\dfrac{∂f}{∂y} = \lambda \dfrac{∂g}{∂y}\), namely:

\[\nonumber \begin{align} y &=2\lambda ,\\[4pt] \nonumber x &=2\lambda \end{align}\]

The general idea is to solve for \(\lambda\) in both equations, then set those expressions equal (since they both equal \(\lambda\)) to solve for \(x \text{ and }y\). Doing this we get

\[\nonumber \dfrac{y}{2} = \lambda = \dfrac{x}{2} \Rightarrow x = y ,\]

so now substitute either of the expressions for \(x \text{ or }y\) into the constraint equation to solve for \(x \text{ and }y\):

\[\nonumber 20 = g(x, y) = 2x+2y = 2x+2x = 4x \quad \Rightarrow \quad x = 5 \quad \Rightarrow \quad y = 5\]

There must be a maximum area, since the minimum area is 0 and \(f (5,5) = 25 > 0\), so the point \((5,5)\) that we found (called a constrained critical point ) must be the constrained maximum.

\(\therefore\) The maximum area occurs for a rectangle whose width and height both are 5 m.

Example 2.26

Find the points on the circle \(x^2 + y^2 = 80\) which are closest to and farthest from the point \((1,2)\).

The distance \(d\) from any point \((x, y)\) to the point \((1,2)\) is

\[\nonumber d = \sqrt{ (x−1)^2 +(y−2)^2} ,\]

and minimizing the distance is equivalent to minimizing the square of the distance. Thus the problem can be stated as:

\[\nonumber \begin{align}\text{Maximize (and minimize) : }&f (x, y) = (x−1)^2 +(y−2)^2 \\[4pt] \nonumber \text{given : }&g(x, y) = x^2 + y^2 = 80 \end{align} \]

Solving \(\nabla f (x, y) = \lambda \nabla g(x, y)\) means solving the following equations:

\[\nonumber \begin{align}2(x−1) &= 2\lambda x , \\[4pt] \nonumber 2(y−2) &= 2\lambda y \end{align} \]

Note that \(x \neq 0\) since otherwise we would get −2 = 0 in the first equation. Similarly, \(y \neq 0\). So we can solve both equations for \(\lambda\) as follows:

\[\nonumber \dfrac{x−1}{x} = \lambda = \dfrac{y−2}{y} \Rightarrow x y− y = x y−2x \quad \Rightarrow \quad y = 2x\]

Substituting this into \(g(x, y) = x^2 + y^2 = 80\) yields \(5x^2 = 80\), so \(x = \pm 4\). So the two constrained critical points are \((4,8)\text{ and }(−4,−8)\). Since \(f (4,8) = 45 \text{ and }f (−4,−8) = 125\), and since there must be points on the circle closest to and farthest from \((1,2)\), then it must be the case that \((4,8)\) is the point on the circle closest to \((1,2)\text{ and }(−4,−8)\) is the farthest from \((1,2)\) (see Figure 2.7.1).

Notice that since the constraint equation \(x^2+y^2 = 80\) describes a circle, which is a bounded set in \(\mathbb{R}^2\), then we were guaranteed that the constrained critical points we found were indeed the constrained maximum and minimum.

The Lagrange multiplier method can be extended to functions of three variables.

Example 2.27

\[\nonumber \begin{align} \text{Maximize (and minimize) : }&f (x, y, z) = x+ z \\[4pt] \nonumber \text{given : }&g(x, y, z) = x^2 + y^2 + z^2 = 1 \end{align}\]

Solve the equation \(\nabla f (x, y, z) = \lambda \nabla g(x, y, z)\):

\[\nonumber \begin{align} 1 &= 2\lambda x \\[4pt] 0 &= 2\lambda y \\[4pt] \nonumber 1 &= 2\lambda z \end{align}\]

The first equation implies \(\lambda \neq 0\) (otherwise we would have 1 = 0), so we can divide by \(\lambda\) in the second equation to get \(y = 0\) and we can divide by \(\lambda\) in the first and third equations to get \(x = \dfrac{1}{2\lambda} = z\). Substituting these expressions into the constraint equation \(g(x, y, z) = x^2 + y^2 + z^2 = 1\) yields the constrained critical points \(\left (\dfrac{1}{\sqrt{2}},0,\dfrac{1}{\sqrt{2}} \right )\) and \(\left ( \dfrac{−1}{\sqrt{2}} ,0,\dfrac{ −1}{\sqrt{2}}\right )\). Since \(f \left ( \dfrac{1}{\sqrt{2}} ,0,\dfrac{ 1}{\sqrt{2}}\right ) > f \left ( \dfrac{−1}{\sqrt{2}} ,0,\dfrac{ −1}{\sqrt{2}}\right )\), and since the constraint equation \(x^2 + y^2 + z^2 = 1\) describes a sphere (which is bounded) in \(\mathbb{R}^ 3\), then \(\left ( \dfrac{1}{\sqrt{2}} ,0,\dfrac{ 1}{\sqrt{2}}\right )\) is the constrained maximum point and \(\left ( \dfrac{−1}{\sqrt{2}} ,0,\dfrac{ −1}{\sqrt{2}}\right )\) is the constrained minimum point.

So far we have not attached any significance to the value of the Lagrange multiplier \(\lambda\). We needed \(\lambda\) only to find the constrained critical points, but made no use of its value. It turns out that \(\lambda\) gives an approximation of the change in the value of the function \(f (x, y)\) that we wish to maximize or minimize, when the constant c in the constraint equation \(g(x, y) = c\) is changed by 1.

For example, in Example 2.25 we showed that the constrained optimization problem

\[\nonumber \begin{align}\text{Maximize : }&f (x, y) = x y \\[4pt] \nonumber \text{given : }&g(x, y) = 2x+2y = 20 \end{align}\]

had the solution \((x, y) = (5,5)\), and that \(\lambda = \dfrac{x}{2} = \dfrac{y}{2}\). Thus, \(\lambda = 2.5\). In a similar fashion we could show that the constrained optimization problem

\[\nonumber \begin{align} \text{Maximize : }&f (x, y) = x y \\[4pt] \nonumber \text{given : }&g(x, y) = 2x+2y = 21 \end{align}\]

has the solution \((x, y) = (5.25,5.25)\). So we see that the value of \(f (x, y)\) at the constrained maximum increased from \(f (5,5) = 25 \text{ to }f (5.25,5.25) = 27.5625\), i.e. it increased by 2.5625 when we increased the value of \(c\) in the constraint equation \(g(x, y) = c \text{ from }c = 20 \text{ to }c = 21\). Notice that \(\lambda = 2.5\) is close to 2.5625, that is,

\[\nonumber \lambda \approx \nabla f=f (\text{new max. pt})− f (\text{old max. pt})\]

Finally, note that solving the equation \(\nabla f (x, y) = \lambda \nabla g(x, y)\) means having to solve a system of two (possibly nonlinear) equations in three unknowns, which as we have seen before, may not be possible to do. And the 3-variable case can get even more complicated. All of this somewhat restricts the usefulness of Lagrange’s method to relatively simple functions. Luckily there are many numerical methods for solving constrained optimization problems, though we will not discuss them here.

- 6.1 Defining the Problem in Matlab m-files
- 6.2 Unconstrained Optimization Problems
- 6.3 Direct Call to an Optimization Routine
- 6.4 Constrained Optimization Problems
- 6.5 Efficient use of the TOMLAB solvers

« Previous « Start » Next »

## Introduction to Unconstrained Optimization

The Wolfram Language has a collection of commands that do unconstrained optimization ( FindMinimum and FindMaximum ) and solve nonlinear equations ( FindRoot ) and nonlinear fitting problems ( FindFit ). All these functions work, in general, by doing a search, starting at some initial values and taking steps that decrease (or for FindMaximum , increase) an objective or merit function.

The search process for FindMaximum is somewhat analogous to a climber trying to reach a mountain peak in a thick fog; at any given point, basically all that climbers know is their position, how steep the slope is, and the direction of the fall line. One approach is always to go uphill. As long as climbers go uphill steeply enough, they will eventually reach a peak, though it may not be the highest one. Similarly, in a search for a maximum, most methods are ascent methods where every step increases the height and stops when it reaches any peak, whether it is the highest one or not.

The analogy with hill climbing can be reversed to consider descent methods for finding local minima. For the most part, the literature in optimization considers the problem of finding minima, and since this applies to most of the Wolfram Language commands, from here on, this documentation will follow that convention.

The FindMinimumPlot command is defined in the Optimization`UnconstrainedProblems` package loaded automatically by this notebook. It runs FindMinimum ; keeps track of the function, gradient evaluations and steps taken during the search (using the EvaluationMonitor and StepMonitor options); and shows them superimposed on a plot of the function. Steps are indicated with blue lines, function evaluations are shown with green points and gradient evaluations are shown with red points. The minimum found is shown with a large black point. From the plot, it is clear that FindMinimum has found a local minimum point.

Starting at 2, FindMinimum heads to different local minima, at which the function is smaller than at the first minimum found.

From these two plots, you might come to the conclusion that if you start at a point where the function is sloping downward, you will always head toward the next minimum in that direction. However, this is not always the case; the steps FindMinimum takes are typically determined using the value of the function and its derivatives, so if the derivative is quite small, FindMinimum may think it has to go quite a long way to find a minimum point.

When starting at x=7 , which is near a local maximum, the first step is quite large, so FindMinimum returns a completely different local minimum.

All these commands have "find" in their name because, in general, their design is to search to find any point where the desired condition is satisfied. The point found may not be the only one (in the case of roots) or even the best one (in the case of fits, minima, or maxima), or, as you have seen, not even the closest one to the starting condition. In other words, the goal is to find any point at which there is a root or a local maximum or minimum. In contrast, the function NMinimize tries harder to find the global minimum for the function, but NMinimize is also generally given constraints to bound the problem domain. However, there is a price to pay for this generality — NMinimize has to do much more work and, in fact, may call one of the "Find" functions to polish a result at the end of its process, so it generally takes much more time than the "Find" functions.

In two dimensions, the minimization problem is more complicated because both a step direction and step length need to be determined.

The FindMinimumPlot command for two dimensions is similar to the one-dimensional case, but it shows the steps and evaluations superimposed on a contour plot of the function. In this example, it is apparent that FindMinimum needed to change direction several times to get to the local minimum. You may notice that the first step starts in the direction of steepest descent (i.e. perpendicular to the contour or parallel to the gradient). Steepest descent is indeed a possible strategy for local minimization, but it often does not converge quickly. In subsequent steps in this example, you may notice that the search direction is not exactly perpendicular to the contours. The search is using information from past steps to try to get information about the curvature of the function, which typically gives it a better direction to go. Another strategy, which usually converges faster but can be more expensive, is to use the second derivative of the function. This is usually referred to as Newton's method .

In this example, it is clear that the extra information that Newton's method uses about the curvature of the function makes a big difference in how many steps it takes to get to the minimum. Even though Newton's method takes fewer steps, it may take more total execution time since the symbolic Hessian has to be computed once and then evaluated numerically at each step.

Usually, there are tradeoffs between the rate of convergence or total number of steps taken and cost per step. Depending on the size of the problems you want to solve, you may want to pick a particular method to best match that tradeoff for a particular problem. This documentation is intended to help you understand those choices as well as some ways to get the best results from the functions in the Wolfram Language. For the most part, examples will be used to illustrate the ideas, but a limited exposition on the mathematical theory behind the methods will be given so that you can better understand how the examples work.

For symbolic minimization of a univariate smooth function, all that is necessary is to find a point at which the derivative is zero and the second derivative is positive. In multiple dimensions, this means that the gradient vanishes and the Hessian needs to be positive definite. (If the Hessian is positive semidefinite, the point is a minimizer, but is not necessarily a strict one.) As a numerical algorithm converges, it is necessary to keep track of the convergence and make some judgment as to when a minimum has been approached closely enough. This is based on the sequence of steps taken and the values of the function, its gradient, and possibly its Hessian at these points. Usually, the Wolfram Language Find* functions will issue a message if they cannot be fairly certain that this judgment is correct. However, keep in mind that discontinuous functions or functions with rapid changes of scale can fool any numerical algorithm.

When solving nonlinear equations , many of the same issues arise as when finding a local minimum . In fact, by considering a so-called merit function, which is zero at the root of the equations, it is possible to use many of the same techniques as for minimization, but with the advantage of knowing that the minimum value of the function is 0. It is not always advantageous to use this approach, and there are some methods specialized for nonlinear equations.

Most examples shown will be from one and two dimensions. This is by no means because the Wolfram Language is restricted to computing with such small examples, but because it is much easier to visually illustrate the main principles behind the theory and methods with such examples.

## Related Guides

- Optimization
- Matrix-Based Minimization
- Curve Fitting & Approximate Functions

## Related Tech Notes

- Unconstrained Optimization

Enable JavaScript to interact with content and submit forms on Wolfram websites. Learn how

Help | Advanced Search

## Electrical Engineering and Systems Science > Systems and Control

Title: pi-cof: a bilevel optimization framework for solving active learning problems using physics-information.

Abstract: Physics informed neural networks (PINNs) have recently been proposed as surrogate models for solving process optimization problems. However, in an active learning setting collecting enough data for reliably training PINNs poses a challenge. This study proposes a broadly applicable method for incorporating physics information into existing machine learning (ML) models of any type. The proposed method - referred to as PI-CoF for Physics-Informed Correction Factors - introduces additive or multiplicative correction factors for pointwise inference, which are identified by solving a regularized unconstrained optimization problem for reconciliation of physics information and ML model predictions. When ML models are used in an optimization context, using the proposed approach translates into a bilevel optimization problem, where the reconciliation problem is solved as an inner problem each time before evaluating the objective and constraint functions of the outer problem. The utility of the proposed approach is demonstrated through a numerical example, emphasizing constraint satisfaction in a safe Bayesian optimization (BO) setting. Furthermore, a simulation study is carried out by using PI-CoF for the real-time optimization of a fuel cell system. The results show reduced fuel consumption and better reference tracking performance when using the proposed PI-CoF approach in comparison to a constrained BO algorithm not using physics information.

## Submission history

Access paper:.

- Download PDF
- HTML (experimental)
- Other Formats

## References & Citations

- Google Scholar
- Semantic Scholar

## BibTeX formatted citation

## Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

- Institution

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

## Hybrid ACO-CI Algorithm for Beam Design Problems

- Original Research
- Published: 22 February 2024
- Volume 5 , article number 282 , ( 2024 )

## Cite this article

- Ishaan R. Kale ORCID: orcid.org/0000-0003-2983-5004 1 ,
- Mandar S. Sapre 2 ,
- Ayush Khedkar 2 ,
- Kaustubh Dhamankar 2 ,
- Abhinav Anand 2 &
- Aayushi Singh 2

The complexity of real-world problems motivates the development of several optimization algorithms. In this paper, a novel hybrid metaheuristic algorithm is presented by combining the complementary properties of a swarm-based Ant Colony Optimization (ACO) and a socio-inspired Cohort Intelligence (CI) algorithm. It is referred to as hybrid ACO-CI. The ACO-CI algorithm is tested by solving unconstrained standard benchmark test functions. The proposed algorithm is further modified to solve constrained problems in the design engineering domain. The effectiveness of the proposed algorithm is evaluated relative to widely used metaheuristic algorithms. It is observed that the hybrid ACO-CI algorithm obtains comparable results with less computation cost.

This is a preview of subscription content, log in via an institution to check access.

## Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

## Data Availability

The source codes and datasets will be made available on request.

Aladeemy M, Tutun S, Khasawneh MT. A new hybrid approach for feature selection and support vector machine model selection based on self-adaptive cohort intelligence. Expert Syst Appl. 2017;88:118–31.

Article Google Scholar

Basiri ME, Nemati S. A novel hybrid ACO-GA algorithm for text feature selection. In: 2009 IEEE Congress on Evolutionary Computation. 2009. p. 2561–68.

Cheng MY, Prayogo D. Symbiotic organisms search: a new metaheuristic optimization algorithm. Comput Struct. 2014;139:98–112.

Civicioglu P. Backtracking search optimization algorithm for numerical optimization problems. Appl Math Comput. 2013;219(15):8121–44.

MathSciNet Google Scholar

Dengiz B, Altiparmak F, Belgin O. Design of reliable communication networks: A hybrid ant colony optimization algorithm. IIE Trans. 2010;42(4):273–87.

Dhavle SV, Kulkarni AJ, Shastri A, Kale IR. Design and economic optimization of shell-and-tube heat exchanger using cohort intelligence algorithm. Neural Comput Appl. 2018;30:111–25.

Dorigo M, Gambardella LM. Ant colonies for the travelling salesman problem. Biosystems. 1997;43(2):73–81.

Article CAS PubMed Google Scholar

Eberhart R, Kennedy J. A new optimizer using particle swarm theory. In: MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science. 1995. p. 39–43.

Feng Y, Jia K, He Y. An improved hybrid encoding cuckoo search algorithm for 0–1 Knapsack problems. Comput Intell Neurosci. 2014;2014:1.

Google Scholar

Gandomi AH, Yang XS, Alavi AH. Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems. Eng Comput. 2013;29:17–35.

Goldberg D. Genetic algorithm in search. Optimization and machine learning. Reading, MA: Addison-Wesley; 1989.

Huang KL, Liao CJ. Ant colony optimization combined with taboo search for the job shop scheduling problem. Comput Oper Res. 2008;35(4):1030–46.

Article MathSciNet Google Scholar

Iyer VH, Mahesh S, Malpani R, Sapre M, Kulkarni AJ. Adaptive range genetic algorithm: a hybrid optimization approach and its application in the design and economic optimization of shell-and-tube heat exchanger. Eng Appl Artif Intell. 2019;85:444–61.

Jiang H, Zhang J, Xuan J, Ren Z, Hu Y. A hybrid ACO algorithm for the next release problem. In The 2nd International Conference on Software Engineering and Data Mining 2010. p. 166–71.

Jona JB, Nagaveni NN. Ant-cuckoo colony optimization for feature selection in digital mammogram. Pak J Biol Sci PJBS. 2014;17:266–327.

Kale IR, Kulkarni AJ. Cohort intelligence algorithm for discrete and mixed variable engineering problems. Int J Parallel Emergent Distrib Syst. 2018;33(6):627–62.

Kale IR, Kulkarni AJ, Satapathy SC. A socio-based cohort intelligence algorithm for engineering problems. In: Socio-cultural Inspired Metaheuristics. 2019. p.121–35.

Kale IR, Kulkarni AJ. Cohort intelligence with self-adaptive penalty function approach hybridized with colliding bodies optimization algorithm for discrete and mixed variable constrained problems. Complex Intell Syst. 2021;7:1565–96.

Kale IR, Khedkar A. CI-SAPF for structural optimization considering buckling and natural frequency constraints. In: Optimization methods for structural engineering. Singapore: Springer Nature Singapore; 2023. p. 41–52.

Chapter Google Scholar

Kale IR, Pachpande MA, Naikwadi SP, Narkhede MN. Optimization of advanced manufacturing processes using socio inspired cohort intelligence algorithm. Int J Simul Multi Design Optim. 2022;13:6.

Kale IR, Khedkar A, Sapre MS. Truss structure optimization using constrained version of variations of cohort intelligence. In: Optimization methods for structural engineering. Singapore: Springer Nature Singapore; 2023. p. 67–78.

Karaboga D. An idea based on honey bee swarm for numerical optimization, Vol. 200. Technical report-tr06, Erciyes University, Engineering Faculty, Computer Engineering Department; 2005. p. 1–10.

Karaboga D, Akay B. A comparative study of artificial bee colony algorithm. Appl Math Comput. 2009;214(1):108–32.

Kefayat M, Ara AL, Niaki SN. A hybrid of ant colony optimization and artificial bee colony algorithm for probabilistic optimal placement and sizing of distributed energy resources. Energy Convers Manage. 2015;92:149–61.

Krishnasamy G, Kulkarni AJ, Paramesran R. A hybrid approach for data clustering based on modified cohort intelligence and K-means. Expert Syst Appl. 2014;41(13):6009–16.

Kulkarni AJ, Shabir H. Solving 0–1 knapsack problem using cohort intelligence algorithm. Int J Mach Learn Cybern. 2016;7:427–41.

Kulkarni AJ, Baki MF, Chaouch BA. Application of the cohort-intelligence optimization method to three selected combinatorial optimization problems. Eur J Oper Res. 2016;250(2):427–47.

Kulkarni AJ, Durugkar IP, Kumar M. Cohort intelligence: a self supervised learning behavior. In 2013 IEEE international conference on systems, man, and cybernetics, 2013. p. 1396–400.

Luan J, Yao Z, Zhao F, Song X. A novel method to solve supplier selection problem: Hybrid algorithm of genetic algorithm and ant colony optimization. Math Comput Simul. 2019;156:294–309.

Mahi M, Baykan ÖK, Kodaz H. A new hybrid method based on particle swarm optimization, ant colony optimization and 3-opt algorithms for traveling salesman problem. Appl Soft Comput. 2015;30:484–90.

Menghour K, Souici-Meslati L. Hybrid ACO-PSO based approaches for feature selection. Int J Intell Eng Syst. 2016;9(3):65–79.

Mohan BC, Baskaran R. A survey: ant colony optimization based recent research and implementation on several engineering domain. Expert Syst Appl. 2012;39(4):4618–27.

Nemati S, Basiri ME, Ghasem-Aghaee N, Aghdam MH. A novel ACO–GA hybrid algorithm for feature selection in protein function prediction. Expert Syst Appl. 2009;36(10):12086–94.

Patankar NS, Kulkarni AJ. Variations of cohort intelligence. Soft Comput. 2018;22(6):1731–47.

Patel S, Kale IR, Kulkarni AJ. Hybridization of cohort intelligence and fuzzy logic (CIFL) for truss structure problems. In: Optimization methods for structural engineering. Singapore: Springer Nature Singapore; 2023. p. 79–93.

Sapre MS, Kulkarni AJ, Shinde SS. Finite element mesh smoothing using cohort intelligence. In Proceedings of the 2nd International Conference on Data Engineering and Communication Technology: ICDECT 2017. Springer Singapore, 2019. p. 469–80

Sapre MS, Kulkarni AJ, Kale IR, Pande MS. Application of Cohort Intelligence Algorithm for Numerical Integration. In Intelligent Systems and Applications: Select Proceedings of ICISA 2022. Singapore: Springer Nature Singapore, 2023. p. 445–53

Sarmah DK, Kulkarni AJ. Image steganography capacity improvement using cohort intelligence and modified multi-random start local search methods. Arab J Sci Eng. 2018;43(8):3927–50.

Sarmah DK, Kale IR. Cryptography algorithm based on cohort intelligence. In Proceedings of the 2nd International Conference on Data Engineering and Communication Technology: ICDECT 2017. Springer Singapore, 2019. p. 431–39

Shastri AS, Kulkarni AJ. Multi-cohort intelligence algorithm: an intra-and inter-group learning behaviour based socio-inspired optimisation methodology. Int J Parallel Emergent Distrib Syst. 2018;33(6):675–715.

Stützle T, Dorigo M. ACO algorithms for the traveling salesman problem. Evol Algorithms Eng Comput Sci. 1999;4:163–83.

Tam JH, Ong ZC, Ismail Z, Ang BC, Khoo SY. A new hybrid GA− ACO− PSO algorithm for solving various engineering design problems. Int J Comput Math. 2019;96(5):883–919.

Tsai YL, Yang YJ, Lin CH. A dynamic decision approach for supplier selection using ant colony system. Expert Syst Appl. 2010;37(12):8313–21.

Yang XS. Firefly algorithms for multimodal optimization. In: International symposium on stochastic algorithms. Springer; 2009.

Yang XS. Nature-inspired optimization algorithms: challenges and open problems. J Comput Sci. 2020;46:101104.

Yucel M, Nigdeli SM Bekdaş G. Generation of an artificial neural network model for optimum design of I-beam with minimum vertical deflection. In 12th HSTAM international congress on mechanics. Thessaloniki, Greece. 2019.

Download references

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

## Author information

Authors and affiliations.

Institute of Artificial Intelligence, Dr Vishwanath Karad MIT World Peace University, Pune, 411038, India

Ishaan R. Kale

Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, 412115, India

Mandar S. Sapre, Ayush Khedkar, Kaustubh Dhamankar, Abhinav Anand & Aayushi Singh

You can also search for this author in PubMed Google Scholar

## Contributions

IRK and MS contributed to conceptualizing the Hybrid ACO-CI algorithm and manuscript writing and preparation of final draft. AK, KD, AA, and AS carried out the computation work and preparation of draft manuscript.

## Corresponding author

Correspondence to Ishaan R. Kale .

## Ethics declarations

Conflict of interest.

The authors have no conflicts of interest to declare that are relevant to the content of this article.

## Research Involving Human and Animals

This work does not contain any studies with human participants or animals performed by any of the authors.

## Informed Consent

Informed consent was obtained from all individual participants included in this study.

## Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Soft Computing and its Engineering Applications” guest edited by Kanubhai K. Patel, Deepak Garg, Atul Patel and Pawan Lingras.

## Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

## About this article

Kale, I.R., Sapre, M.S., Khedkar, A. et al. Hybrid ACO-CI Algorithm for Beam Design Problems. SN COMPUT. SCI. 5 , 282 (2024). https://doi.org/10.1007/s42979-024-02612-y

Download citation

Received : 26 March 2023

Accepted : 08 January 2024

Published : 22 February 2024

DOI : https://doi.org/10.1007/s42979-024-02612-y

## Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Ant Colony Optimization
- Cohort Intelligence algorithm
- Hybridization
- Design optimization problem

Advertisement

- Find a journal
- Publish with us
- Track your research

## Advanced Concepts Team

Informatics, structure-guided quantum computing.

Quantum computing [1] is an emerging computational model that tries to utilize quantum mechanical effects. It still needs to be clarified when and why computations involving quantum effects are advantageous over classical algorithms. Since quantum computers are not yet available at scale, it is of immense importance to develop high-quality simulators to understand possible use cases of this technology for the European Space Agency. Currently, only noisy intermediate-scale quantum (NISQ) machines are available, which use some quantum effects but are sensitive to their environment (noisy) and susceptible to decoherence. Parts of this technology fall into the broader class of Ising machines . Various implementations of such machines are available today but still subject to size limitations.

## Hybrid Approaches to Utilize Ising Machines

Many problems of great importance in operations research, formal verification, and artificial intelligence are NP -complete and, thus, deemed infeasible. A recent route in tackling such problems are hardware accelerators , which are specialized chips build to solve one particular optimization problem. An Ising machine or Ising processing unit (IPU) [3] is a piece of hardware cable of finding the ground state of an Ising model . A prominent incarnation of Ising machines are quantum annealers that try to find the ground state using quantum tunneling [2] .

However, hardware accelerators of this kind have a fundamental drawback: They can only solve instances of a fixed size, that is, the number of variables they can handle is determined during the fabrication of the chip. The Advanced Concepts Team studies hybrid approaches in which a classical algorithm utilizes an IPU to solve subproblems of affordable size. To that end, the quadratic unconstrained binary optimization ( QUBO ) problem that underlies the Ising model is translated into a logic problem ( MaxSAT ) in a structure-preserving way. The first goal of this project is the development of a pipeline that partitions the resulting MaxSAT formula into a well-structured part and an unstructured rest. We then plan to guide the solution process along the well-structured part using dynamic programming while the unstructured parts of the formula are cast back into a QUBO and solved using an IPU.

## Graphical and Structural Representations of Quantum Processes

Tensor networks are graphical representations of computations on high-dimensional arrays used to study many-body quantum systems [3] . Tree decompositions are structural representations of graphs to measure their similarity to trees. It is well-known that these concepts are tightly linked [4] . In particular, quantum computations can be expressed as tensor networks and simulated using tree decompositions.

In the second part of this project, we investigate whether recent advances in structural graph theory can be used to speed up computations involving tensor networks. In particular, we aim to develop faster simulators for quantum circuits using these techniques. In the opposite direction, we investigate the effect common assumptions and restrictions on quantum computations have on the structural properties of the underlying tensor network. New insights into these effects will help to understand when and why a quantum computer may (or may not) have an advantage over a classical computer.

## Entangled Project

This project is accompanied by the application-driven project Advancing electronic structure simulations, a multifaceted exploration .

[1] Michael A. Nielsen and Isaac L. Chuang Quantum Computation and Quantum Information Cambridge (2011)

[2] Mark W. Johnson et al. Quantum annealing with manufactured spins Nature 473, 194–198 (2011)

[3] Naeimeh Mohseni, Peter L. McMahon, and Tim Byrnes Ising machines as hardware solvers of combinatorial optimization problems Nature Reviews Physics 4, pages 363–379 (2022)

[4] Igor L. Markov and Yaoyun Shi Simulating Quantum Computation by Contracting Tensor Networks SIAM Journal on Computing 38, pages 963-981 (2008)

## IMAGES

## VIDEO

## COMMENTS

Newton's method Golden-section search method 2 Part II: multidimensional unconstrained optimization Analytical method Gradient method — steepest ascent (descent) method Newton's method PART I: One-Dimensional Unconstrained Optimization Techniques 1 Analytical approach (1-D) minx F (x) or maxx F (x) 2 Let F 0(x) = 0 and find x = x¤.

The types of problems that we solved previously were examples of unconstrained optimization problems. If ... The method we used required us to find the critical points of \(f\), which meant having to solve the equation \(\nabla f = \textbf{0}\), which in general is a system of two equations in two unknowns (\(x \text{ and }y\)). While this was ...

As we have discussed in the first chapter, an unconstrained optimization problem deals with finding the local minimizer x∗ x ∗ of a real valued and smooth objective function f (x) f ( x) of n n variables, given by f: Rn → R f: R n → R, formulated as, minf (x) x∈Rn (2.1) (2.1) min f ( x) x ∈ R n with no restrictions on the decision variables x x.

An introductory lecture on unconstrained optimization within a course on "Optimal and Robust Control" (A3M35ORR, AE3M35ORR) taught at Faculty of Electrical E...

Welcome to my video series on Multivariable Differential Calculus. You can access the full playlist here:https://www.youtube.com/playlist?list=PLL9sh_0TjPuOL...

Figure 12.1. The conceptual steps of the constrained optimization algorithms initiated from a feasible point. 1. The gradient of the cost function vanishes at the point, so it is an unconstrained stationary point. We need to check the second-order conditions for optimality of the point. 2.

Unconstrained optimization crops up in many fields under different names, and understanding how to solve these types of problems is incredibly powerful. Some examples include regressing a linear function onto the data set. Here we are minimizing the sum of squared error.

Much of modern machine learning and deep learning depends on formulating and solving an unconstrained optimization problem, by incorporating constraints as additional elements of the loss with suitable penalties. In this article, we will give a broad intuition on solving unconstrained optimization problems. Prerequisites

Unconstrained Optimization In previous chapters, we have chosen to take a largely variational approach to deriving standard algorithms for computational linear algebra. That is, we deﬁne an objective function, possibly with constraints, and pose our algorithms as a minimization or maximization problem. A sampling

The solution of such systems may be posed as ﬁnding the minimum of the potential energy of the system or the minimum of the residuals of the equations in a least squared sense. 4.1 Minimization of Functions of One Variable In most structural design problems the objective is to minimize a function with many design variables, but the study of mini...

What's unconstrained multivariate optimization? As the name suggests multivariate optimization with no constraints is known as unconstrained multivariate optimization. Example : min f (x̄) w.r.t x̄ x̄ ∈ R n

We will begin our discussion of solving general optimization problems by considering the unconstrained case. Our template problem is minimize x2RN f(x); (1) where f : RN!R. We say that such a problem is unconstrained because any possible x 2RN is allowable. We will consider the case where only a subset of RN is allowable once we have an idea of ...

To understand the trust-region approach to optimization, consider the unconstrained minimization problem, minimize f ( x ), where the function takes vector arguments and returns scalars. Suppose you are at a point x in n -space and you want to improve, i.e., move to a point with a lower function value.

This attempts to solve the problem with the default method, which is the Levenberg - Marquardt method, since the function is a sum of squares: The Levenberg - Marquardt method is converging slowly on this problem because the residual is nonzero near the minimum and the second-order part of the Hessian is needed.

In this section we will use a general method, called the Lagrange multiplier method, for solving constrained optimization problems: Maximize (or minimize) : f(x, y) (or f(x, y, z)) given : g(x, y) = c (or g(x, y, z) = c) for some constant c. The equation g(x, y) = c is called the constraint equation, and we say that x and y are constrained by g ...

This section describes how to define and solve unconstrained and constrained optimization problems. Several examples are given on how to proceed, depending on if a quick solution is wanted, or more advanced runs are needed. TOMLAB is also compatible with MathWorks Optimization TB. See Appendix E for more information and test examples.

The Wolfram Language has a collection of commands that do unconstrained optimization (FindMinimum and FindMaximum) and solve nonlinear equations and nonlinear fitting problems ().All these functions work, in general, by doing a search, starting at some initial values and taking steps that decrease (or for FindMaximum, increase) an objective or merit function.

ABSTRACT. This chapter reviews the basic mathematical tools that are useful for analyzing both unconstrained and constrained optimization problems. It provides three application examples to illustrate how it could apply the optimization techniques to solve real-world problems, with a focus on communications, networking, and signal processing.

12. Write a constraint for the number of boxes needed in order to box up 100 books. If it takes you 4 minutes to bike a mile, 9 minutes to run a mile and 14 minutes to walk a mile, write a constraint that limits how many miles of each type of exercise you can get in a 45-minute lunch break.

Introducing the slack variable $\epsilon$ you can write the unconstrained equivalent optimization problem with the Lagrangian ... \cdot\mathbf{x}_i)^2 + \lambda \left(\sum_{j=1}^m w_j^2-\tau + \epsilon^2\right) $$ NOTES. 1) Solving the lagrangian we will have it's stationary points which should be qualified. 2) Analyzing the the values for ...

Download PDF Abstract: In this paper, we propose a machine learning (ML) method to learn how to solve a generic constrained continuous optimization problem. To the best of our knowledge, the generic methods that learn to optimize, focus on unconstrained optimization problems and those dealing with constrained problems are not easy-to-generalize.

In part 1, we covered basic optimization theory — including 1) setting up and solving a simple single variable optimization problem analytically, 2) iterative optimization schemes — namely, gradient descent & Newton's Method, and 3) implementing Newton's method by hand and in python for a multi-dimensional optimization problem.

Solving an unconstrained optimization problem in R Ask Question Asked Viewed 144 times Part of 0 I am working on an assignment where I need to find the extreme point of a function. My maximization function is (x*26.7-2*x^2)/2. This equation is the objective function with the constrain plugged-in.

Haoqi He. This paper explores the application of Quadratic Unconstrained Binary Optimization (QUBO) models in solving the Travelling Salesman Problem (TSP) through Quantum Annealing algorithms and Graph Neural Networks. Quantum Annealing (QA), a quantum-inspired optimization method that exploits quantum tunneling to escape local minima, is used ...

Physics informed neural networks (PINNs) have recently been proposed as surrogate models for solving process optimization problems. However, in an active learning setting collecting enough data for reliably training PINNs poses a challenge. This study proposes a broadly applicable method for incorporating physics information into existing machine learning (ML) models of any type. The proposed ...

The complexity of real-world problems motivates the development of several optimization algorithms. In this paper, a novel hybrid metaheuristic algorithm is presented by combining the complementary properties of a swarm-based Ant Colony Optimization (ACO) and a socio-inspired Cohort Intelligence (CI) algorithm. It is referred to as hybrid ACO-CI. The ACO-CI algorithm is tested by solving ...

The Advanced Concepts Team studies hybrid approaches in which a classical algorithm utilizes an IPU to solve subproblems of affordable size. To that end, the quadratic unconstrained binary optimization problem that underlies the Ising model is translated into a logic problem in a structure-preserving way. The first goal of this project is the ...