- Theorem
Let
be a function and let
be another function. Assume that
is differentiable at
and that
is differentiable at
.
Then
is differentiable at
and

- Proof
We prove that
is a valid differential of
, thereby proving differentiability.
We begin by noting from the second triangle inequality, that
![{\displaystyle \left|{\frac {{\big \|}g(x_{0}+\mathbf {h} )-g(x_{0}){\big \|}}{\|\mathbf {h} \|}}-{\frac {\|g'(x_{0})\mathbf {h} \|}{\|\mathbf {h} \|}}\right|\leq {\frac {{\Big \|}g(x_{0}+\mathbf {h} )-{\big [}g(x_{0})+g'(x_{0})\mathbf {h} {\big ]}{\Big \|}}{\|\mathbf {h} \|}}\to 0,\mathbf {h} \to 0}](https://wikimedia.org/api/rest_v1/media/math/render/svg/649b1a0f80fb77c6dcab734b6767b4bab25d3cca)
and hence the boundedness of

implies that of

where
is the matrix of
.
Now we note by the triangle inequality, that
![{\displaystyle {\begin{aligned}&{\frac {{\Big \|}(f\circ g)(x_{0}+\mathbf {h} )-{\big [}(f\circ g)(x_{0})+f'{\big (}g(x_{0}){\big )}g'(x_{0})\mathbf {h} {\big ]}{\Big \|}}{\|\mathbf {h} \|}}\\&\leq {\frac {{\bigg \|}f{\big (}g(x_{0}+h){\big )}-{\Big [}f{\big (}g(x_{0}){\big )}+f'{\big (}g(x_{0}){\big )}{\big [}g(x_{0}+h)-g(x_{0}){\big ]}{\Big ]}{\bigg \|}}{\|\mathbf {h} \|}}\\&+{\frac {{\bigg \|}f{\big (}g(x_{0}){\big )}+f'{\big (}g(x_{0}){\big )}{\big [}g(x_{0}+h)-g(x_{0}){\big ]}-{\Big [}(f\circ g)(x_{0})+f'{\big (}g(x_{0}){\big )}g'(x_{0})\mathbf {h} {\Big ]}{\bigg \|}}{\|\mathbf {h} \|}}\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/16ce647fd32a46ef5d5688739801c75e3029fe38)
We shall first treat the first summand, which is more difficult, but not so difficult still. We rewrite it as
![{\displaystyle {\frac {{\bigg \|}f{\big (}g(x_{0}+h){\big )}-{\Big [}f{\big (}g(x_{0}){\big )}+f'{\big (}g(x_{0}){\big )}{\big [}g(x_{0}+h)-g(x_{0}){\big ]}{\Big ]}{\bigg \|}}{{\big \|}g(x_{0}+h)-g(x_{0}){\big \|}}}\cdot {\frac {{\big \|}g(x_{0}+h)-g(x_{0}){\big \|}}{\|\mathbf {h} \|}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/edb0ec6f4222e1eb7a2923631dea418a272622cd)
The latter factor is bounded due to the above considerations, and the first one converges to 0 as
(and thus
due to the same boundedness (multiply with
); in fact, differentiability thus implies continuity).
Now for the second summand, which, by elementary cancellation and linearity of differentials, equals
![{\displaystyle {\frac {{\bigg \|}f'{\big (}g(x_{0}){\big )}{\Big [}{\big [}g(x_{0}+h)-g(x_{0}){\big ]}-g'(x_{0})\mathbf {h} {\Big ]}{\bigg \|}}{\|\mathbf {h} \|}}\leq mn\max _{1\leq i\leq l \atop 1\leq j\leq n}|b_{i,j}|{\frac {{\Big \|}{\big [}g(x_{0}+h)-g(x_{0}){\big ]}-g'(x_{0})\mathbf {h} {\Big \|}}{\|\mathbf {h} \|}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/a6fd36c73a73df330c8ef7878d78245f420ace2d)
where
is the matrix of the differential of
. This goes to 0 as
due to the definition of the differential of
.
The first application of the chain rule that we shall present has something to do with a thing called gradient, which is defined for functions
, that is, the image is one-dimensional (in the special case
these functions look like "mountains" of a function on the plane
).
- Definition
Let
be differentiable. Then the column vector

is called the gradient.
Theorem:
Let
be two functions totally differentiable at
. Since they both map to
, their product is defined, and we have

Proof:
Now one could compute this directly from the definition of the gradient and the usual one-dimensional product rule (which actually has the merit of not requiring total differentiability), but there is a clever trick using the chain rule, which I found in Terence Tao's lecture notes, on which I based my repetition of this part of mathematics.
We simply define
and
. Then the function
equals
. Now the differential of
is given by the Jacobian matrix

and the differential of
is given by the Jacobian matrix

Hence, the product rule implies that the differential of
at
is given by

and from the definition of the gradient we see that the differential is nothing but the transpose of the gradient (and vice versa, as taking transpose is idempotent).
Now we shall use the chain rule to generalize a well-known theorem from one dimension, the mean value theorem, to several dimensions.
Theorem:
Let
be totally differentiable, and let
. Then there exists
such that

where
is the standard scalar product on
.
Proof:
This is actually a straightforward application of the chain rule.
We set

thus
and
. By the one-dimensional mean-value theorem,

for a suitable
. Now by the chain rule,
.
The next theorem shows that the order of differentiation does not matter, provided that the considered function is sufficiently differentiable. We will not need the general chain rule or any of its consequences during the course of the proof, but we will use the one-dimensional mean-value theorem.
Theorem (Clairaut's theorem):
Let
be such that the partial derivatives up to order 2 exist and are continuous. Then
.
Proof:
We begin with the following lemma:
Lemma:

Proof: We first apply the fundamental theorem of calculus to obtain that the above limit equals

Using integration by substitution and linearity of the integral, we may rewrite this as

Now we apply the mean value theorem in one variable to obtain

for a suitable
. Hence, the above limit equals

This is the average of
over a certain subset of
and therefore converges to
by the continuity of
(you can prove this rigorously by using

and subtracting the integrals and applying the triangle inequality for integrals).
Now the expression of the lemma is totally symmetric in
and
, which is why Clairaut's theorem follows.