Statistics/Numerical Methods/Optimization

Introduction

In the following we will provide some notes on numerical optimization algorithms. As there are numerous methods out there, we will restrict ourselves to the so-called Gradient Methods. There are basically two arguments why we consider this class as a natural starting point when thinking about numerical optimization algorithms. On the one hand, these methods are really workhorses in the field, so their frequent use in practice justifies their coverage here. On the other hand, this approach is highly intuitive in the sense that it somewhat follow naturally from the well-known properties of optima. In particular we will concentrate on three examples of this class: the Newtonian Method, the Method of Steepest Descent and the class of Variable Metric Methods, nesting amongst others the Quasi Newtonian Method.

Before we start we will nevertheless stress that there does not seem to be a "one and only" algorithm but the performance of specific algorithms is always contingent on the specific problem to be solved. Therefore both experience and "trial-and-error" are very important in applied work. To clarify this point we will provide a couple of applications where the performance of different algorithms can be compared graphically. Furthermore a specific example on Maximum Likelihood Estimation can be found at the end. Especially for statisticians and econometricians the Maximum Likelihood Estimator is probably the most important example of having to rely on numerical optimization algorithms in practice.

Theoretical Motivation

Any numerical optimization algorithm has solve the problem of finding "observable" properties of the function such that the computer program knows that a solution is reached. As we are dealing with problems of optimization two well-known results seem to be sensible starting points for such properties.

If f is differentiable and $x^{\star }$ is a (local) minimum, then

$(1a)\quad Df(x^{\star })=0$

i.e. the Jacobian $Df(x)$ is equal to zero

and

If f is twice differentiable and $x^{\star }$ is a (local) minimum, then

$(1b)\quad x^{T}D^{2}f(x^{\star })x\geq 0$

i.e. the Hessian $D^{2}f(x)$ is pos. semidefinite.

In the following we will always denote the minimum by $x^{\star }$ . Although these two conditions seem to represent statements that help in finding the optimum $x^{\star }$ , there is the little catch that they give the implications of $x^{\star }$ being an optimum for the function $f$ . But for our purposes we would need the opposite implication, i.e. finally we want to arrive at a statement of the form: "If some condition $g(f(x^{\star }))$ is true, then $x^{\star }$ is a minimum". But the two conditions above are clearly not sufficient in achieving this (consider for example the case of $f(x)=x^{3}$ , with $Df(0)=D^{2}f(0)=0$ but $x^{\star }\neq 0$ ). Hence we have to look at an entire neighborhood of $x^{\star }$ as laid out in the following sufficient condition for detecting optima:

If $Df(x^{\star })=0$ and $x^{T}D^{2}f(z)x\geq 0,\forall x\in \mathbb {R} ^{n}$ and $z\in {\mathcal {B}}(x^{\star },\delta )$ , then: $x^{\star }$ is a local minimum.

Proof: For $x\in {\mathcal {B}}(x^{\star },\delta )$ let $z=x^{\star }+t(x-x^{\star })\in {\mathcal {B}}$ . The Taylor approximation yields: $f(x)-f(x^{\star })=0+{\frac {1}{2}}(x-x^{\star })^{T}D^{2}f(z)(x-x^{\star })\geq 0$ , where ${\mathcal {B}}(x^{\star },\delta )$ denotes an open ball around $x^{\star }$ , i.e. ${\mathcal {B}}(x^{\star },\delta )=\{x:||x-x^{\star }||\leq \delta \}$ for $\delta >0$ .

In contrast to the two conditions above, this condition is sufficient for detecting optima - consider the two trivial examples

$f(x)=x^{3}$ with $Df(x^{\star }=0)=0$ but $x^{T}D^{2}f(z)x=6zx^{2}\not \geq 0\quad (e.g.\,\,z=-{\frac {\delta }{2}})$

and

$f(x)=x^{4}$ with $Df(x^{\star }=0)=0$ and $x^{T}D^{2}f(z)x=12z^{2}x^{2}\geq 0\quad \forall z$ .

Keeping this little caveat in mind we can now turn to the numerical optimization procedures.

Numerical Solutions

All the following algorithms will rely on the following assumption:

(A1) The set $N(f,f(x^{(0)})=\{x\in \mathbb {R} ^{n}|f(x)\leq f(x^{(0)})\}$ is compact

where $x^{(0)}$ is some given starting value for the algorithm. The significance of this assumption has to be seen in the Weierstrass Theorem which states that every compact set contains its supremum and its infimum. So (A1) ensures that there is some solution in $N(f,f(x^{(0)})$ . And at this global minimum $x^{\star }$ it of course holds true that $D(f(x^{\star }))=0$ . So - keeping the discussion above in mind - the optimization problem basically boils down to the question of solving set of equations $D(f(x^{\star }))=0$ .

The Direction of Descent

The problems with this approach are of course rather generically as $D(f(x^{\star }))=0$ does hold true for maxima and saddle points as well. Hence, good algorithms should ensure that both maxima and saddle points are ruled out as potential solutions. Maxima can be ruled out very easily by requiring $f(x^{(k+1)})<f(x^{(k)})$ i.e. we restrict ourselves to a sequence $\{x^{(k)}\}_{k}$ such that the function value decreases in every step. The question is of course if this is always possible. Fortunately it is. The basic insight why this is the case is the following. When constructing the mapping $x^{(k+1)}=\varphi (x^{(k)})$ (i.e. the rule how we get from $x^{(k)}$ to $x^{(k+1)}$ ) we have two degrees of freedoms, namely

the direction $d^{(k)}$ and

the step length $\sigma ^{(k)}$ .

Hence we can choose in which direction we want to move to arrive at $x^{(k+1)}$ and how far this movement has to be. So if we choose $d^{(k)}$ and $\sigma ^{(k)}$ in the "right way" we can effectively ensure that the function value decreases. The formal representation of this reasoning is provided in the following

Lemma: If $d^{(k)}\in \mathbb {R} ^{n}$ and $Df(x)^{T}d^{(k)}<0$ then: $\exists {\bar {\sigma }}>0$ such that

$f(x+\sigma ^{(k)}d^{(k)})<f(x)\quad \forall \sigma \in (0,{\bar {\sigma }})$

Proof: As $Df(x)^{T}d^{(k)}<0$ and $Df(x)^{T}d^{(k)}=\lim _{\sigma \rightarrow 0}{\frac {f(x+\sigma ^{(k)}d^{(k)})-f(x)}{\sigma ^{(k)}}}$ , it follows that $f(x+\sigma ^{(k)}d^{(k)})<f(x)$ for $\sigma ^{(k)}$ small enough.

The General Procedure of Descending Methods

A direction vector $d^{(k)}$ that satisfies this condition is is called a Direction of Descent. In practice this Lemma allows us to use the following procedure to numerically solve optimization problems.

1. Define the sequence $\{x^{(k)}\}_{k}$ recursively via $x^{(k+1)}=x^{(k)}+\sigma ^{(k)}d^{(k)}$

2. Choose the direction $d^{(k)}$ from local information at the point $x^{(k)}$

3. Choose a step size $\sigma ^{(k)}$ that ensures convergence of the algorithm.

4. Stop the iteration if $|f(x^{(k+1)})-f(x^{(k)})|<\epsilon$ where $\epsilon >0$ is some chosen tolerance value for the minimum

This procedure already hints that the choice of $d^{(k)}$ and $\sigma ^{(k)}$ are not separable, but rather dependent. Especially note that even if the method is a descending method (i.e. both $d^{(k)}$ and $\sigma ^{(k)}$ are chosen according to Lemma 1) the convergence to the minimum is not guaranteed. At a first glance this may seem a bit puzzling. If we found a sequence $\{x^{(k)}\}_{k}$ such that the function value decreases at every step, one might think that at some stage, i.e. in the limit of k tending to infinity we should reach the solution. Why this is not the case can be seen from the following example borrowed from W. Alt (2002, p. 76).

Example 1

Consider the following example which does not converge although it is clearly descending. Let the criterion function be given by

$f(x)=x^{2}$ , let the starting value be $x^{(0)}=1$ , consider a (constant) direction vector $d^{(k)}=-1$ and choose a step width of $\sigma ^{(k)}=({\frac {1}{2}})^{k+2}$ . Hence the recursive definition of the sequence $\{x^{(k)}\}_{k}$ follows as

$(2)\quad x^{(k+1)}=x^{(k)}+({\frac {1}{2}})^{k+2}(-1)=x^{(k-1)}-\left({\frac {1}{2}}\right)^{k+1}-\left({\frac {1}{2}}\right)^{k+2}=x^{(0)}-\sum _{j=0}^{k}\left({\frac {1}{2}}\right)^{j+2}.$

Note that $x^{(k)}>0\,\,\forall \,\,k$ and hence $f(x^{(k+1)})<f(x^{(k)})\,\,\forall \,\,k$ , so that it is clearly a descending method. Nevertheless we find that

$(3)\quad \lim _{k\rightarrow \infty }x^{(k)}=\lim _{k\rightarrow \infty }x^{(0)}-\sum _{j=0}^{k-1}\left({\frac {1}{2}}\right)^{j+2}=\lim _{k\rightarrow \infty }1-{\frac {1}{4}}\left({\frac {1-\left({\frac {1}{2}}\right)^{k}}{\frac {1}{2}}}\right)=\lim _{k\rightarrow \infty }{\frac {1}{2}}+\left({\frac {1}{2}}\right)^{k+1}={\frac {1}{2}}\neq 0=x^{\star }.$

The reason for this non-convergence has to be seen in the stepsize $\sigma ^{(k)}$ decreasing too fast. For large k the steps $x^{(k+1)}-x^{(k)}$ get so small that convergence is precluded. Hence we have to link the stepsize to the direction of descend $d^{(k)}$ .

Efficient Stepsizes

The obvious idea of such a linkage is to require that the actual descent is proportional to a first order approximation, i.e. to choose $\sigma ^{(k)}$ such that there is a constant $c_{1}>0$ such that

$(4)\quad f(x^{(k)}+\sigma ^{(k)}d^{(k)})-f(x^{(k)})\leq c_{1}\sigma ^{(k)}D(f(x^{(k)}))d^{(k)}<0.$

Note that we still look only at descending directions, so that $Df(x^{(k)})^{T}d^{(k)}<0$ as required in Lemma 1 above. Hence, the compactness of $N(f,f(x^{(k)}))$ implies the convergence of the LHS and by (4)

$(5)\quad \lim _{k\rightarrow \infty }\sigma ^{(k)}D(f(x^{(k)}))d^{(k)}=0.$

Finally we want to choose a sequence $\{x^{(k)}\}_{k}$ such that $\lim _{k\rightarrow \infty }D(f(x^{(k)}))=0$ because that is exactly the necessary first order condition we want to solve. Under which conditions does (5) in fact imply $\lim _{k\rightarrow \infty }D(f(x^{(k)}))=0$ ? First of all the stepsize $\sigma ^{(k)}$ must not go to zero too quickly. That is exactly the case we had in the example above. Hence it seems sensible to bound the stepsize from below by requiring that

$(6)\quad \sigma ^{(k)}\geq -c_{2}{\frac {Df(x^{(k)})^{T}d^{(k)}}{||d^{(k)}||^{2}}}>0$

for some constant $c_{2}>0$ . Substituting (6) into (5) finally yields

$(7)\quad f(x^{(k)}+\sigma ^{(k)}d^{(k)})-f(x^{(k)})\leq -c\left({\frac {Df(x^{(k)})^{T}d^{(k)}}{||d^{(k)}||}}\right)^{2},\quad c=c_{1}c_{2}$

where again the compactness of $N(f,f(x^{(k)}))$ ensures the convergence of the LHS and hence

$(8)\quad \lim _{k\rightarrow \infty }-c({\frac {Df(x^{(k)})^{T}d^{(k)}}{||d^{(k)}||}})^{2}=\lim _{k\rightarrow \infty }{\frac {Df(x^{(k)})^{T}d^{(k)}}{||d^{(k)}||}}=0$

Stepsizes that satisfy (4) and (6) are called efficient stepsizes and will be denoted by $\sigma _{E}^{(k)}$ . The importance of condition (6) is illustrated in the following continuation of Example 1.

Example 1 (continued)

Note that it is exactly the failure of (6) that induced Example 1 not to converge. Substituting the stepsize of the example into (6) yields

$(6.1)\quad \sigma ^{(k)}=\left({\frac {1}{2}}\right)^{(k+2)}\geq -c_{2}{\frac {2x^{(k)}(-1)}{1}}=c_{2}\cdot 2({\frac {1}{2}}+\left({\frac {1}{2}})^{k+1}\right)\Leftrightarrow {\frac {1}{4(1+2^{(k)})}}\geq c_{2}>0$

so there is no constant $c_{2}>0$ satisfying this inequality for all k as required in (6). Hence the stepsize is not bounded from below and decreases too fast. To really acknowledge the importance of (6), let us change the example a bit and assume that $\sigma ^{(k)}=\left({\frac {1}{2}}\right)^{k+1}$ . Then we find that

$(6.2)\quad \lim _{k\rightarrow \infty }x^{(k+1)}=\lim _{k\rightarrow \infty }x^{(0)}-{\frac {1}{2}}\sum _{i}\left({\frac {1}{2}}\right)^{i}=\lim _{k\rightarrow \infty }\left({\frac {1}{2}}\right)^{k+1}=0=x^{\star },$

i.e. convergence actually does take place. Furthermore recognize that this example actually does satisfy condition (6) as

$(6.3)\quad \sigma ^{(k)}=\left({\frac {1}{2}}\right)^{(k+1)}\geq -c_{2}{\frac {2x^{(k)}(-1)}{1}}=c_{2}\cdot 2\left({\frac {1}{2}}\right)^{k}\Leftrightarrow {\frac {1}{4}}\geq c_{2}>0.$

Choosing the Direction d

We have already argued that the choice of $\sigma ^{(k)}$ and $d^{(k)}$ is intertwined. Hence the choice of the "right" $d^{(k)}$ is always contingent on the respective stepsize $\sigma ^{(k)}$ . So what does "right" mean in this context? Above we showed in equation (8) that choosing an efficient stepsize implied

$(8')\quad \lim _{k\rightarrow \infty }-c({\frac {Df(x^{(k)})^{T}d^{(k)}}{||d^{(k)}||}})^{2}=\lim _{k\rightarrow \infty }{\frac {Df(x^{(k)})^{T}d^{(k)}}{||d^{(k)}||}}=0.$

The "right" direction vector will therefore guarantee that (8') implies that

$(9)\quad \lim _{k\rightarrow \infty }Df(x^{(k)})=0$

as (9) is the condition for the chosen sequence $\{x^{(k)}\}_{k}$ to converge. So let us explore what directions could be chosen to yield (9). Assume that the stepsize $\sigma _{k}$ is efficient and define

$(10)\quad \beta ^{(k)}={\frac {Df(x^{(k)})^{T}d^{(k)}}{||Df(x^{(k)})||||d^{(k)}||}}\quad \Leftrightarrow \quad \beta ^{(k)}||Df(x^{(k)})||={\frac {Df(x^{(k)})^{T}d^{(k)}}{||d^{(k)}||}}$

By (8') and (10) we know that

$(11)\quad \lim _{k\rightarrow \infty }\beta ^{(k)}||Df(x^{(k)})||=0.$

So if we bound $\beta ^{(k)}$ from below (i.e. $\beta ^{(k)}\leq -\delta <0$ ), (11) implies that

$(12)\quad \lim _{k\rightarrow \infty }\beta ^{(k)}||Df(x^{(k)})||=\lim _{k\rightarrow \infty }||Df(x^{(k)})||=\lim _{k\rightarrow \infty }Df(x^{(k)})=0,$

where (12) gives just the condition of the sequence $\{x^{(k)}\}_{k}$ converging to the solution $x^{\star }$ . As (10) defines the direction vector $d^{(k)}$ implicitly by $\beta ^{(k)}$ , the requirements on $\beta ^{(k)}$ translate directly into requirements on $d^{(k)}$ .

Why Gradient Methods?

When considering the conditions on $\beta ^{(k)}$ it is clear where the term Gradient Methods originates from. With $\beta ^{(k)}$ given by

$\beta _{k}={\frac {D(f(x))d^{(k)}}{||Df(x^{(k)})||||d^{(k)}||}}=\cos(Df(x^{(k)}),d^{(k)})$

we have the following result

Given that $\sigma ^{(k)}$ was chosen efficiently and $d^{(k)}$ satisfies

$(13)\quad \cos(Df(x^{(k)}),d^{(k)})=\beta _{k}\leq -\delta <0$

we have

$(14)\quad \lim _{k\rightarrow \infty }Df(x^{(k)})\rightarrow 0$

Hence: Convergence takes place if the angle between the negative gradient at $x^{(k)}$ and the direction $d^{(k)}$ is consistently smaller than the right angle. Methods relying on $d^{(k)}$ satisfying (13) are called Gradient Methods.

In other words: As long as one is not moving orthogonal to the gradient and if the stepsize is chosen efficiently, Gradient Methods guarantee convergence to the solution $x^{\star }$ .

Some Specific Algorithms in the Class of Gradient Methods

Let us now explore three specific algorithms of this class that differ in their respective choice of $d^{(k)}$ .

The Newtonian Method

The Newtonian Method is by far the most popular method in the field. It is a well known method to solve for the roots of all types of equations and hence can be easily applied to optimization problems as well. The main idea of the Newtonian method is to linearize the system of equations to arrive at

$(15)\quad g(x)=g({\hat {x}})+Dg({\hat {x}})^{T}(x-{\hat {x}})=0.$

(15) can easily be solved for x as the solution is just given by (assuming $Dg({\hat {x}})^{T}$ to be non-singular)

$(16)\quad x={\hat {x}}-[Dg({\hat {x}})^{T}]^{-1}g({\hat {x}}).$

For our purposes we just choose $g(x)$ to be the gradient $Df(x)$ and arrive at

$(17)\quad d_{N}^{(k)}=x^{(k+1)}-x^{(k)}=-[D^{2}f(x^{(k)})]^{-1}Df(x^{(k)})$

where $d_{N}^{(k)}$ is the so-called Newtonian Direction.

Properties of the Newtonian Method

Analyzing (17) elicits the main properties of the Newtonian method:

If $D^{2}f(x^{(k)})$ is positive definite, $d_{N}^{k}$ is a direction of descent in the sense of Lemma 1.

The Newtonian Method uses local information of the first and second derivative to calculate $d_{N}^{k}$ .

As

$(18)\quad x^{(k+1)}=x^{(k)}+d_{N}^{(k)}$

the Newtonian Method uses a fixed stepsize of $\sigma ^{(k)}=1$ . Hence the Newtonian method is not necessarily a descending method in the sense of Lemma 1. The reason is that the fixed stepsize $\sigma ^{(k)}=1$ might be larger than the critical stepsize ${\bar {\sigma _{k}}}$ given in Lemma 1. Below we provide the Rosenbrock function as an example where the Newtonian Method is not descending.

The Method can be time-consuming as calculating $[D^{2}f(x^{(k)})]^{-1}$ for every step k can be cumbersome. In applied work one could think about approximations. One could for example update the Hessian only every sth step or one could rely on local approximations. This is known as the Quasi-Newtonian-Method and will be discussed in the section about Variable Metric Methods.

To ensure the method to be decreasing one could use an efficient stepsize $\sigma _{E}^{(k)}$ and set

$(19)\quad x^{(k+1)}=x^{(k)}-\sigma _{E}^{(k)}d_{N}^{(k)}=x^{(k)}-\sigma _{E}^{(k)}[D^{2}f(x^{k})]^{-1}Df(x^{(k)})$

Method of Steepest Descent

Another frequently used method is the Method of Steepest Descent. The idea of this method is to choose the direction $d^{(k)}$ so that the decrease in the function value f is maximal. Although this procedure seems at a first glance very sensible, it suffers from the fact that it uses effectively less information than the Newtonian Method by ignoring the Hessian's information about the curvature of the function. Especially in the applications below we will see a couple of examples of this problem.

The direction vector of the Method of Steepest Descent is given by

$(20)\quad d_{SD}^{(k)}=argmax_{d:||d||=r}\{-Df(x^{(k)})^{T}d\}=argmin_{d:||d||=r}\{Df(x^{(k)})^{T}d\}=-r{\frac {Df(x)}{||Df(x)||}}$

Proof: By the Cauchy-Schwartz Inequality it follows that

$(21)\quad {\frac {Df(x)^{T}d}{||Df(x)||||d||}}\geq -1\quad \Leftrightarrow \quad Df(x)^{T}d\geq -r||Df(x)||.$

Obviously (21) holds with equality for $d^{(k)}=d_{SD}^{(k)}$ given in (20).

Note especially that for $r=||Df(x)||$ we have $d_{SD}^{(k)}=-Df(x^{(k)})$ , i.e. we just "walk" in the direction of the negative gradient. In contrast to the Newtonian Method the Method of Steepest Descent does not use a fixed stepsize but chooses an efficient stepsize $\sigma _{E}^{(k)}$ . Hence the Method of Steepest Descent defines the sequence $\{x^{(k)}\}_{k}$ by

$(22)\quad x^{(k+1)}=x^{(k)}+\sigma _{E}^{(k)}d_{SD}^{(k)},$

where $\sigma _{E}^{(k)}$ is an efficient stepsize and $d_{SD}^{(k)}$ the Direction of Steepest Descent given in (20).

Properties of the Method of Steepest Descent

With $d_{SD}^{(k)}=-r{\frac {Df(x)}{||Df(x)||}}$ the Method of Steepest Descent defines a direction of descent in the sense of Lemma 1, as

$Df(x)^{T}d_{SD}^{(k)}=Df(x)^{T}\left(-r{\frac {Df(x)}{||Df(x)||}}\right)=-{\frac {r}{||Df(x)||}}Df(x)^{T}Df(x)<0.$

The Method of Steepest Descent is only locally sensible as it ignores second order information.

Especially when the criterion function is flat (i.e. the solution $x^{\star }$ lies in a "valley") the sequence defined by the Method of Steepest Descent fluctuates wildly (see the applications below, especially the example of the Rosenbrock function).

As it does not need the Hessian, calculation and implementation of the Method of Steepest Descent is easy and fast.

Variable Metric Methods

A more general approach than both the Newtonian Method and the Method of Steepest Descent is the class of Variable Metric Methods. Methods in this class rely on the updating formula

$(23)\quad x^{k+1}=x^{k}-\sigma _{E}^{(k)}[A^{k}]^{-1}Df(x^{k}).$

If $A^{k}$ is a symmetric and positive definite matrix, (23) defines a descending method as $[A^{k}]^{-1}$ is positive definite if and only if $A^{k}$ is positive definite as well. To see this: just consider the spectral decomposition

$(24)\quad A^{k}=\Gamma \Lambda \Gamma ^{T}$

where $\Gamma$ and $\Lambda$ are the matrices with eigenvectors and eigenvalues respectively. If $A^{k}$ is positive definite, all eigenvalues $\lambda _{i}$ are strictly positive. Hence their inverse $\lambda _{i}^{-1}$ are positive as well, so that $[A^{k}]^{-1}=\Gamma \Lambda ^{-1}\Gamma ^{T}$ is clearly positive definite. But then, substitution of $d^{(k)}=[A^{k}]^{-1}Df(x^{k})$ yields

$(25)\quad Df(x^{k})^{T}d^{(k)}=-Df(x^{k})^{T}[A^{k}]^{-1}Df(x^{k})\equiv -v^{T}[A^{k}]^{-1}v\leq 0,$

i.e. the method is indeed descending. Up to now we have not specified the matrix $A^{k}$ , but is easily seen that for two specific choices, the Variable Metric Method just coincides with the Method of Steepest Descent and the Newtonian Method respectively.

For $A^{k}={\mathcal {I}}$ (with ${\mathcal {I}}$ being the identity matrix) it follows that

$(22')\quad x^{k+1}=x^{k}-\sigma _{E}^{(k)}Df(x^{k})$

which is just the Method of Steepest Descent.

For $A^{k}=D^{2}f(x^{k})$ it follows that

$(19')\quad x^{k+1}=x^{k}-\sigma _{E}^{(k)}[D^{2}f(x^{k})]^{-1}Df(x^{k})$

which is just the Newtonian Method using a stepsize $\sigma _{E}^{(k)}$ .

The Quasi Newtonian Method

A further natural candidate for a Variable Metric Method is the Quasi Newtonian Method. In contrast to the standard Newtonian Method it uses an efficient stepsize so that it is a descending method and in contrast to the Method of Steepest Descent it does not fully ignore the local information about the curvature of the function. Hence the Quasi Newtonian Method is defined by the two requirements on the matrix $A^{k}$ :

$A^{k}$ should approximate the Hessian $D^{2}f(x^{k})$ to make use of the information about the curvature and

the update $A^{k}\rightarrow A^{k+1}$ should be easy so that the algorithm is still relatively fast (even in high dimensions).

To ensure the first requirement, $A^{k+1}$ should satisfy the so-called Quasi-Newtonian-Equation

$(26)\quad A^{k+1}(x^{(k+1)}-x^{(k)})=Df(x^{(k+1)})-Df(x^{(k)})$

as all $A^{k}$ satisfying (26) reflect information about the Hessian. To see this, consider the function $g(x)$ defined as

$(27)\quad g(x)=f(x^{k+1})+Df(x^{k+1})^{T}(x-x^{k+1})+{\frac {1}{2}}(x-x^{k+1})^{T}A^{k+1}(x-x^{k+1}).$

Then it is obvious that $g(x^{k+1})=f(x^{k+1})$ and $Dg(x^{k+1})=Df(x^{k+1})$ . So $g(x)$ and $f(x)$ are reasonably similar in the neighborhood of $x^{(k+1)}$ . In order to ensure that $g(x)$ is also a good approximation at $x^{(k)}$ , we want to choose $A^{k+1}$ such that the gradients at $x^{(k)}$ are identical. With

$(28)\quad Dg(x^{k})=Df(x^{k+1})-A^{k+1}(x^{k+1}-x^{k})$

it is clear that $Dg(x^{k})=Df(x^{k})$ if $A^{k+1}$ satisfies the Quasi Newtonian Equation given in (26). But then it follows that

$(29)\quad A^{k+1}(x^{k+1}-x^{k})=Df(x^{k+1})-Dg(x^{k})=Df(x^{k+1})-Df(x^{k})=D^{2}f(\lambda x^{(k)}+(1-\lambda )x^{(k+1)})(x^{k+1}-x^{k}).$

Hence as long as $x^{(k+1)}$ and $x^{(k)}$ are not too far apart, $A^{k+1}$ satisfying (26) is a good approximation of $D^{2}f(x^{(k)})$ .

Let us now come to the second requirement that the update of the $A^{k}$ should be easy. One specific algorithm to do so is the so-called BFGS-Algorithm. The main merit of this algorithm is the fact that it uses only the already calculated elements $\{x^{(k)}\}_{k}$ and $\{Df(x^{(k)})\}_{k}$ to construct the update $A^{(k+1)}$ . Hence no new entities have to be calculated but one has only to keep track of the x-sequence and sequence of gradients. As a starting point for the BFGS-Algorithm one can provide any positive definite matrix (e.g. the identity matrix or the Hessian at $x^{(0)}$ ). The BFGS-Updating-Formula is then given by

$(30)\quad A^{k}=A^{k-1}-{\frac {(A^{k-1})^{T}\gamma _{k-1}^{T}\gamma _{k-1}A^{k-1}}{\gamma _{k-1}^{T}A^{k-1}\gamma _{k-1}}}+{\frac {\Delta _{k-1}\Delta _{k-1}^{T}}{\Delta _{k-1}^{T}\gamma _{k-1}}}$

where $\Delta _{k-1}=Df(x^{(k)})-Df(x^{(k-1)})$ and $\gamma _{k-1}=x^{(k)}-x^{(k-1)}$ . Furthermore (30) ensures that all $A^{k}$ are positive definite as required by Variable Metric Methods to be descending.

Properties of the Quasi Newtonian Method

It uses second order information about the curvature of $f(x)$ as the matrices $A^{k}$ are related to the Hessian $D^{2}f(x)$ .

Nevertheless it ensures easy and fast updating (e.g. by the BFGS-Algorithm) so that it is faster than the standard Newtonian Method.

It is a descending method as $A^{k}$ are positive definite.

It is relatively easy to implement as the BFGS-Algorithm is available in most numerical or statistical software packages.

Applications

To compare the methods and to illustrate the differences between the algorithms we will now evaluate the performance of the Steepest Descent Method, the standard Newtonian Method and the Quasi Newtonian Method with an efficient stepsize. We use two classical functions in this field, namely the Himmelblau and the Rosenbrock function.

Application I: The Himmelblau Function

The Himmelblau function is given by

$(31)\quad f(x,y)=(x^{2}+y-11)^{2}+(x+y^{2}-7)^{2}$

This fourth order polynomial has four minima, four saddle points and one maximum so there are enough possibilities for the algorithms to fail. In the following pictures we display the contour plot and the 3D plot of the function for different starting values.

In Figure 1 we display the function and the paths of all three methods at a starting value of $(2,-4)$ . Obviously the three methods do not find the same minimum. The reason is of course the different direction vector of the Method of Steepest Descent - by ignoring the information about the curvature it chooses a totally different direction than the two Newtonian Methods (see especially the right panel of Figure 1).

Figure 1: The two Newton Methods converge to the same, the Method of Steepest Descent to a different minimum.

Consider now the starting value $(4.5,-0.5)$ , displayed in Figure 2. The most important thing is of course that now all methods find different solutions. That the Method of Steepest Descent finds a different solution than the two Newtonian Methods is again not that surprising. But that the two Newtonian Methods converge to different solution shows the significance of the stepsize $\sigma$ . With the Quasi-Newtonian Method choosing an efficient stepsize in the first iteration, both methods have different stepsizes and direction vectors for all iterations after the first one. And as seen in the picture: the consequence may be quite significant.

Application II: The Rosenbrock Function

The Rosenbrock function is given by

$(32)\quad f(x,y)=100(y-x^{2})^{2}+(1-x)^{2}\$

Although this function has only one minimum it is an interesting function for optimization problems. The reason is the very flat valley of this U-shaped function (see the right panels of Figures 3 and 4). Especially for econometricians this function may be interesting because in the case of Maximum Likelihood estimation flat criterion functions occur quite frequently. Hence the results displayed in Figures 3 and 4 below seem to be rather generic for functions sharing this problem.

My experience when working with this function and the algorithms I employed is that Figure 3 (given a starting value of $(2,-5)$ ) seems to be quite characteristic. In contrast to the Himmelblau function above, all algorithms found the same solution and given that there is only one minimum this could be expected. More important is the path the different methods choose as is reflects the different properties of the respective methods. It is seen that the Method of Steepest Descent fluctuates rather wildly. This is due to the fact that it does not use information about the curvature but rather jumps back and forth between the "hills" adjoining the valley. The two Newtonian Methods choose a more direct path as they use the second order information. The main difference between the two Newtonian Methods is of course the stepsize. Figure 3 shows that the Quasi Newtonian Method uses very small stepsizes when working itself through the valley. In contrast, the stepsize of the Newtonian Method is fixed so that it jumps directly in the direction of the solution. Although one might conclude that this is a disadvantage of the Quasi Newtonian Method, note of course that in general these smaller stepsizes come with benefit of a higher stability, i.e. the algorithm is less likely to jump to a different solution. This can be seen in Figure 4.

Figure 3: All methods find the same solution, but the Method of Steepest Descent fluctuates heavily.

Figure 4, which considers a starting value of $(-2,-2)$ , shows the main problem of the Newtonian Method using a fixed stepsize - the method might "overshoot" in that it is not descending. In the first step, the Newtonian Method (displayed as the purple line in the figure) jumps out of the valley to only bounce back in the next iteration. In this case convergence to the minimum still occurs as the gradient at each side points towards the single valley in the center, but one can easily imagine functions where this is not the case. The reason of this jump are the second derivatives which are very small so that the step $[Df(x^{(k)})]^{-1}Df(x^{(k)}))$ gets very large due to the inverse of the Hessian. In my experience I would therefore recommend to use efficient stepsizes to have more control over the paths the respective Method chooses.

Figure 2: Overshooting of the Newtonian Method due to the fixed stepsize.

Application III: Maximum Likelihood Estimation

For econometricians and statisticians the Maximum Likelihood Estimator is probably the most important application of numerical optimization algorithms. Therefore we will briefly show how the estimation procedure fits in the framework developed above.

As usual let

$(33)\quad f(Y|X;\theta )$

be the conditional density of $Y$ given $X$ with parameter $\theta$ and

$(34)\quad l(\theta ;Y|X)$

the conditional likelihood function for the parameter $\theta$

If we assume the data to be independently, identically distributed (iid) then the sample log-likelihood follows as

$(35)\quad {\mathcal {L}}(\theta ;Y_{1},\dots ,Y_{N})=\sum _{i}^{N}{\mathcal {L}}(\theta ;Y_{i})=\sum _{i}^{N}\log(l(\theta ;Y_{i})).$

Maximum Likelihood estimation therefore boils down to maximize (35) with respect to the parameter $\theta$ . If we for simplicity just decide to use the Newtonian Method to solve that problem, the sequence $\{\theta ^{(k)}\}_{k}$ is recursively defined by

$(36)\quad D_{\theta }{\mathcal {L}}(\theta ^{(k+1)})=D_{\theta }{\mathcal {L}}(\theta ^{(k)})+D_{\theta \theta }{\mathcal {L}}(\theta ^{(k)})(\theta ^{(k+1)}-\theta ^{(k)})=0\Leftrightarrow \theta ^{(k+1)}=\theta ^{(k)}-[D_{\theta \theta }{\mathcal {L}}(\theta ^{(k)})]^{-1}D_{\theta }{\mathcal {L}}(\theta ^{(k)})$

where $D_{\theta }{\mathcal {L}}$ and $D_{\theta \theta }{\mathcal {L}}$ denotes the first and second derivative with respect to the parameter vector $\theta$ and $[D_{\theta \theta }{\mathcal {L}}(\theta ^{(k)})]^{-1}D_{\theta }{\mathcal {L}}(\theta ^{(k)})$ defines the Newtonian Direction given in (17). As Maximum Likelihood estimation always assumes that the conditional density (i.e. the distribution of the error term) is known up to the parameter $\theta$ , the methods described above can readily be applied.

A Concrete Example of Maximum Likelihood Estimation

Assume a simple linear model

$(37a)\quad Y_{i}=\beta _{1}+\beta _{x}X_{i}+U_{i}$

with $\theta =(\beta _{1},\beta _{2})'$ . The conditional distribution Y is then determined by the one of U, i.e.

$(37b)\quad p(Y_{i}-\beta _{1}-\beta _{x}X_{i})\equiv p_{\mid X_{i}}(Y_{i})=p(U_{i}),$

where $p$ denotes the density function. Generally, there is no closed form solution of maximizing (35) (at least if U does not happen to be normally distributed), so that numerical methods have to be employed. Hence assume that U follows Student's t-distribution with m degrees of freedom so that (35) is given by

$(38)\quad {\mathcal {L}}(\theta ;Y_{\mid X})=\sum \log \left({\frac {\Gamma ({\frac {m+1}{2}})}{{\sqrt {\pi m}}\Gamma ({\frac {m}{2}})}}(1+{\frac {(y_{i}-x_{i}^{T}\beta )^{2}}{m}})^{-{\frac {m+1}{2}}}\right)$

where we just used the definition of the density function of the t-distribution. (38) can be simplified to

$(39)\quad {\mathcal {L}}(\theta ;Y_{\mid X})=N\left[\log \left(\Gamma \left({\frac {m+1}{2}}\right)\right)-\log \left({\sqrt {\pi m}}\Gamma \left({\frac {m}{2}}\right)\right)\right]-{\frac {m+1}{2}}\sum \log \left(1+{\frac {(y_{i}-x_{i}^{T}\beta )^{2}}{m}}\right)$

so that (if we assume that the degrees of freedom m are known)

$(40)\quad {\textit {argmax}}\left\{{\mathcal {L}}(\theta ;Y_{\mid X})\right\}={\textit {argmax}}\left\{-{\frac {m+1}{2}}\sum \log \left(1+{\frac {(y_{i}-x_{i}^{T}\beta )^{2}}{m}}\right)\right\}={\textit {argmin}}\left\{\sum \log \left(1+{\frac {(y_{i}-x_{i}^{T}\beta )^{2}}{m}}\right)\right\}.$

With the criterion function

$(41)\quad f(\beta _{1},\beta _{2})=\sum \log \left(1+{\frac {(y_{i}-\beta _{1}-\beta _{2}x_{i})^{2}}{m}}\right)$

the methods above can readily applied to calculate the Maximum Likelihood Estimator $({\hat {\beta }}_{1,ML},{\hat {\beta }}_{2,ML})$ maximizing (41).

References

Alt, W. (2002): "Nichtlineare Optimierung", Vieweg: Braunschweig/Wiesbaden
Härdle, W. and Simar, L. (2003): "Applied Multivariate Statistical Analysis", Springer: Berlin Heidelberg
Königsberger, K. (2004): "Analysis I", Springer: Berlin Heidelberg
Ruud, P. (2000): "Classical Econometric Theory", Oxford University Press: New York