Isotonic regression

Problem

Given values $a_{1},...,a_{n}$ and their positive weights $w_{1},...,w_{n}$ , approximate them as closely as possible by $y_{1},...y_{n}$ , subject to a set of constraints of kind $y_{i}\leq y_{j}$ :

Given

values

\mathbf {a} \in {\mathbb {R}}^{n}

,

weights

\mathbf {w} \in {\mathbb {R}}^{n}

such that

w_{i}>0

for all

i

,

set of constraints

E\subset \{1,...,n\}^{2}

,

minimize

\sum _{i}w_{i}\left(y_{i}-a_{i}\right)^{2}

with respect to

\mathbf {y} \in {\mathbb {R}}^{n}

subject to

y_{i}\leq y_{j}

for all

(i,j)\in E

If all weights are equal to 1, the problem is called unweighted isotonic regression, otherwise it is called weighted isotonic regression.

The graph $G=(V=\{1,2,...n\},E=E)$ must to be acyclic. For example, hte constraints $x_{2}\leq x_{5}$ and $x_{5}\leq x_{2}$ cannot exist simultaneously.

Example:

Minimize

3(y_{1}-2)^{2}+(y_{2}-1)^{2}+(y_{3}-4)^{2}+2y_{4}^{2}

subject to

y_{1}\leq y_{2}

y_{2}\leq y_{3}

y_{2}\leq y_{4}

Linear order

Of a particular interest is linear order isotonic regression.

Given

values

\mathbf {a} \in {\mathbb {R}}^{n}

,

weights

\mathbf {w} \in {\mathbb {R}}^{n}

such that

w_{i}>0

for all

i

,

minimize

\sum _{i}w_{i}\left(y_{i}-a_{i}\right)^{2}

with respect to

\mathbf {y} \in {\mathbb {R}}^{n}

subject to

y_{1}\leq y_{2}\leq \;...\;\leq y_{n}

Linear ordering isotonic regression for 100 points.

For linear order isotonic regression, there is a simple linear algorithm, called #Pool Adjacent Violators Algorithm (PAVA).

If all weights are equal to 1, the problem is called unweighted linear ordering isotonic regression.

Linear order isotonic regression can be thought off as approximating given series of 1-dimensional observations with non-decreasing function. It is similar to smoothing splines, with the difference that we use monotonicity, rather than smoothness, to remove noise from the data.

Example

Linear ordering isotonic regression for 106 points. — Linear ordering isotonic regression for $106$ points.

Isotonic regression for $10^{6}$ points.
Left: points $(x_{i},a_{i})$ , where $a_{i}=0{\text{ or }}1$ . The probability that $a_{i}=1$ is determined by logistic function. Only 1000 points shown.
Right: result of isotonic regression (black graph) versus logistic function (red graph). The logistic function was recovered with high accuracy.

Linear order isotonic regression as a model

Sometimes, linear ordering isotonic regression is applied to a set of observations $(x_{i},y_{i}),1\leq i\leq n$ , where $x$ is the explanatory variable, y is the dependent variable. The observations are sorted by their $x$ 's, then the isotonic regression is applied to the $y$ 's with the additional constraint $x_{i}=x_{i+1}\Rightarrow y_{i}=y_{i+1}$ for all $i$ .

Other variants

Non-Euclidean metrics

Sometimes other metrics are used instead of the Euclidean metric, for example $L_{1}$ metrics:

\sum _{i}w_{i}\left|y_{i}-a_{i}\right|

or unweighted $L_{\infty }$ metrics:

\max _{i}\left|y_{i}-a_{i}\right|

Points on a grid

Sometimes, values are placed on a 2d or higher dimensional grid. The fit value must increase along each dimension, e.g.

Minimize

\sum _{ij}w_{ij}\left(y_{ij}-x_{ij}\right)^{2}

with respect to y subject to

y_{ij}\leq y_{kl}{\text{ if }}i\leq j,\ k\leq l

Algorithms

Pool Adjacent Violators Algorithm

The Pool Adjacent Violators Algorithm (PAVA) is a linear-time (and linear-memory) algorithm for linear order isotonic regression.

Preliminary considerations

The algorithm is based on the following theorem:

Theorem: For an optimal solution, if $a_{i}\geq a_{i+1}$ , then $y_{i}=y_{i+1}$

Proof: Suppose the opposite, i.e. $y_{i}<y_{i+1}$ . Then for sufficiently small $\varepsilon$ , we can set

y_{i}^{\mathrm {new} }=y_{i}+w_{i+1}\varepsilon

y_{i+1}^{\mathrm {new} }=y_{i+1}-w_{i}\varepsilon

which decreases the sum $\sum _{i}w_{i}(y_{i}-a_{i})^{2}$ without violating the constraints. Therefore, our original solution was not optimal. Contradiction.

Since $y_{i}=y_{i+1}$ , we can combine both points $(w_{i},a_{i})$ and $(w_{i+1},a_{i+1})$ to a new point $\left(w_{i}+w_{i+1},{w_{i}a_{i}+w_{i+1}a_{i+1} \over w_{i}+w_{i+1}}\right)$ .

However, after combining two points $(w_{i},a_{i})$ and $(w_{i+1},a_{i+1})$ to the new point $\left(w_{i}^{\prime },a_{i}^{\prime }\right)$ , this new point can violate the constraint $a_{i-1}\leq a_{i}^{\prime }$ . In this case it should be combined with the $(i-1)$ -th point. If the combined point violates its constraint, it should be combined with the previous point, and so on.

Algorithm

Input:

array of n values: initial_values[1] ... initial_values[n]

array of n weights: weights[1] ... weights[n]

Output:

array called results (results[1] ... results[n]) minimizing sum by i of weights[i] * (initial_values[i] - results[i]) ** 2

The algorithm:

Initialize

pooled_value[1] = initial_values[1]

pooled_weight[1] = weights[1]

num_segments = 1

segment_end[0] = 0

segment_end[1] = 1

For current_index = 2 to n do:

num_segments++

pooled_value[num_segments] = initial_values[current_index]

pooled_weight[num_segments] = weights[current_index]

while num_segments > 1 and pooled_value[num_segments] < pooled_value[num_segments - 1] do

pooled_value[num_segments - 1] =

(pooled_weight[num_segments] * pooled_value[num_segments] + pooled_weight[num_segments - 1] * pooled_value[num_segments - 1]) /

(pooled_weight[num_segments] + pooled_weight[num_segments - 1])

pooled_weight[num_segments - 1] += pooled_weight[num_segments]

num_segments--

segment_end[num_segments] = current_index

for segment_index = 1 to num_segments do:

for value_index = segment_end[segment_index - 1] + 1 to segment_end[segment_index] do:

result[value_index] = pooled_value[segment_index]

Here $S$ defines to which old points each new point corresponds.

Arbitrary case algorithms

In the arbitrary case, this can be solved as a quadratic problem. The best algorithm takes $\Theta (n^{4})$ time, see:

Maxwell, WL and Muckstadt, JA (1985), "Establishing consistent and realistic reorder intervals in production-distribution systems", Operations Research 33, pp. 1316-1341.
Spouge J, Wan H, and Wilbur WJ (2003), "Least squares isotonic regression in two dimensions", J. Optimization Theory and Apps. 117, pp. 585-605.

Implementations

R

isoreg

The function isoreg performs unweighted linear ordering isotonic regression. It does not require any packages. For many simple cases, it is enough.

Example of usage:

x=sort(rnorm(10000))
y=x+rnorm(10000)
y.iso=isoreg(y)$yf
plot(x,y,cex=0.2)
lines(x,y.iso,lwd=2,col=2)

The isoreg function also implements linear ordering isotonic regression as a model:

x=rnorm(10000)
y=x+rnorm(10000)
y.iso=isoreg(x,y)$yf
plot(x,y,cex=0.2)
lines(sort(x),y.iso,lwd=2,col=2)

Iso

The package Iso contains three functions:

pava - linear ordering isotonic regression, weighted or unweighted.
biviso - 2-d isotonic regression
ufit - unimodal order (increasing then decreasing)

Example of usage:

install.packages("Iso") # should be done only once
library("Iso") # should be done once per session
x=sort(rnorm(10000))
y=x+rnorm(10000)
y.iso=pava(y)
plot(x,y,cex=0.2); lines(x,y.iso,lwd=2,col=2)

isotone

This is the most advanced package. It contains two functions:

gpava - linear ordering isotonic regression, weighted or unweighted, for any metrics. Similarly to isoreg, gpava can implement linear ordering isotonic regression as a model.
activeSet - general isotonic regression for any metrics.

Example of usage:

install.packages("isotone") # should be done only once
library("isotone") # should be done once per session
x=sort(rnorm(10000))
y=x+rnorm(10000)
y.iso=gpava(y)$x
plot(x,y,cex=0.2); lines(x,y.iso,lwd=2,col=2)

Comparison of speed

Since all three libraries somehow implement the PAVA algorithm, we can compare their speed.

As we can see on the graph below, for unweighted linear order isotonic regression (LOIR) with $n>200$ , isoreg should be used. For weighted LOIR and unweighted LOIR with $n\leq 200$ , pava should be used. As for gpava, it should be used only for non-Euclidean metrics.

Also, the implementations of weighted simple ordering isotonic regression on R are far from perfect.

Java

Weka, a free software collection of machine learning algorithms for data mining tasks written in the University of Waikato, contains, among others, an isotonic regression classifier. The classifier is deeply ingrained into the Weka's hierarchy of classes and cannot be used without Weka.

Python

While scikit-learn implements the isotonic regression, Andrew Tulloch made the Cython implementation of the algorithm for linear ordering isotnic regression, which is 14-5000 times faster, depending on the data size. See [Speeding up isotonic regression in scikit-learn by 5,000x https://tullo.ch/articles/speeding-up-isotonic-regression/]. If you just want the code, click [here https://gist.github.com/ajtulloch/9447845#file-_isotonic-pyx].

Usage

Calibrating the output for categorical probabilities

For more information, see "Predicting Good Probabilities With Supervised Learning", by Alexandru Niculescu-Mizil and Rich Caruana.

Most of the supervised learning algorithms, for example Random Forests[1], boosted trees, SVM etc. are good in predicting the most probable category, but not the probability. Some of them tend to overestimate high probabilities and underestimate low probabilities. (One notable exception is neural networks, which themselves produce a well calibrated prediction.)

In order to obtain the correct probability, unweighted linear order isotonic regression can be used:

{\hat {y}}^{\mathrm {final} }=f({\hat {y}})

where

{\hat {y}}^{\mathrm {final} }

is a final prediction for probability

{\hat {y}}

is a prediction given by the model

f

is a non-decreasing function

To find f, the model's predictions on the validation set are matched with the output variable, and the isotonic regression is applied to the pairs.

An alternative approach is to pass ${\hat {y}}$ through a sigmoid function:

{\hat {y}}^{\mathrm {final} }={1 \over 1+e^{a-b{\hat {y}}}}

This is called Platt calibration.

For small data sets, Platt calibration is better than isotonic regression. Starting at 200-5000 observations, the isotonic regression slightly outperforms Platt calibration. Note also that this kind of isotonic regression is simpler and faster than the logistic regression required by Platt calibration.

Calibration of recommendation models

The problem

The following example is for recommendation engines, whose general purpose is to predict the grade given by user $u$ to item $i$ .

We have a model M that predicts the residuals r of another model, trained on the training set. For cases with small number of users or items, the model is overfitted, and the predictions need to be relaxed:

{\hat {r}}_{ui}^{\mathrm {final} }=f({\hat {S}}_{ui}){\hat {r}}_{ui}

where

{\hat {r}}_{ui}^{\mathrm {final} }

is a final prediction for user

u

, item

i

{\hat {r}}_{ui}

is a prediction given by the model

f

is a non-decreasing function

S_{ui}

may be either

S_{u}

(number of observations with user

u

in the training set), or

S_{i}

(number of observations with item

i

in the training set), or

\min(S_{u},S_{i})

, depending on the nature of the model. The letter

S

stands for support.

Solution

Given the set of triplets $({\hat {r}}_{k},r_{k},S_{k})$ from validation database, we have:

\min _{f}\sum _{k}\left(r_{k}-{\hat {r}}_{k}f(S_{k})\right)^{2}=\min _{f}\sum _{k}{\hat {r}}_{k}^{2}\left(f(S_{k})-{\frac {r_{k}}{{\hat {r}}_{k}}}\right)^{2}+\operatorname {const}

Therefore, for $k$ -th observation we should set weight $w_{k}={\hat {y}}_{k}^{2}$ and value $y_{k}=r_{k}/{\hat {r}}_{k}$ . The triplets $(S_{k},w_{k},y_{k})$ then sorted with respect to $S$ , and the isotonic regression applied to them.

When a function directly predicts the rating or acceptance rate, rather than predicting the residual of some other model, we can define $r_{ui}$ as the difference between this model and some simpler model (i. e. average rating for the item plus average rating correction for the user). This model is called basic model.

Analysis and calibrating input variables

Often, we know that an output variable depends monotonically on an input variable, but otherwise we have very little prior knowledge about the dependence. In this case, isotonic regression can provide important clues about the dependence.

Non-metric multidimensional scaling

A multidimensional scaling (MDS) algorithm, given the matrix of item–item similarities, places items into N-dimensional space such that similar items would be closer to each other than dissimilar items. This is especially useful for visualization purposes.

There are several types of MDS, depending on the relationship between similarity and distance, as well as the definition of distance. For non-metric MDS, the distance can be any monotone function of similarity. The function is typically found using isotonic regression, see Non-metric multidimensional scaling in wikipedia. Non-metric multidimensional scaling is implemented as [2] in the R's MASS library.

Advantages, disadvantages, and post-processing

Advantages:

Non-parametric
Fast (linear time)
Simple

Disadvantages:

one or more points at the ends of the interval are sometimes noisy
it compares favorably with other approaches only if the statistics are large enough ( $n\gtrapprox 10^{3}$ and ideally $n>10^{4}$ )

These drawbacks can be improved by smoothing the result of the isotonic regression. In this way, we can get the best of both worlds (smoothness and monotonicity).

External links

Isotonic Regression Algorithms, by Quentin F. Stout - review of the best currently avaliable algorithms
Predicting Good Probabilities With Supervised Learning, by Alexandru Niculescu-Mizil, Rich Coruana
Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, by John C. Platt