# Statistics/Curve fitting

Whenever trying to evaluate data that has been collected, often patterns appear, such as a -1 slope when making a scatter plot of in ray optics. It may often be the goal to find a mathematical function that "fits" the data. That is to say a function whose values are close to the data values at the corresponding values and independent values. This is often referred to as the "least squares", and the reason for which is explained later.

## Contents

## Sales Example[edit]

A store sells whatsits at P=3.49 each and the average number of whatsits sold (the volume) per day is V=100. Therefore the total money received T=P times V=349.00 ..... If the price is reduced then, maybe, more whatsits will be sold, but T may be more or less. Obviously if P=0 then T will also be zero. The following was the result:

P V T 2.99 130 388.70 3.29 123 404.67 3.49 100 349.00

Obviously the "best" price is somewhere between 2.99 and 3.49. ..... Curve fitting provides an equation for T versus P for each of the many models that are available for comparison.

### Linear model[edit]

The linear model is based on __the "best" straight line__. Using a calculator that can do regression, we find for the above data that the closest line of the graph showing T versus P is

- T=605.268605263 - 68.9289473684 * P, and the correlation is shown as about 60% for this model.

Let us examine it in more detail:

PActual TCalculated TDifferenceDifference^{2}

2.99 388.70 399.17105263159 - 10.4710526316 109.642943214 3.29 404.67 378.49236842106 26.1776315789 685.268395081 3.49 349.00 364.70657894738 - 15.7065789474 246.696622231

Adding the differences, we find that their sum is nearly zero, indicating that it is the "best" linear model. Squaring a negative number always gives a positive number. so that the **SUM OF SQUARES** will give us an indication of the **GOODNESS OF FIT**. Here the **SUM OF SQUARES** is 1041.60796053, and we can compare the different models, selecting finally the model that has the **LEAST SQUARES**.

If you do NOT have a calculator or a computer that can do regression, then.....

#### Calculation of the least square line to fit the given points:[edit]

LOOKING FOR a and b in the equation of the straight line y=a+b*x:

We have, in the above example:

x x^{2}y y^{2}xy 2.99 8.9401 388.70 151087.69 1162.213 3.29 10.8241 404.67 163757.8089 1331.3643 3.49 12.1801 349.00 121801 1218.01 ---- ------- ------- ----------- --------- 9.77 31.9443 1142.37 436646.4989 3711.5873

We have: n = number of points = 3

ax=average of x=9.77/3=3.256

ay=average of y=1142.37/3=380.79

x1=sum of x=9.77

x2=sum of x^{2}=31.9443

y1=sum of y=1142.37

y2=sum of y^{2}=436646.4989

s1=sum of xy=3711.5873

z1=s1-(x1*y1/n)=3711.5873-(9.77*1142.37/3)= -8.731

z2=x2-(x1^{2}/n)=31.9443-9.77^{2}/3=0.126

b=z1/z2=-68.9289473682

a=ay-b*ax=380.79-(-68.9289473682)*3.256=605.268605263

Thus we have y=605.268605263-68.92894736828*x as the best line to fit the given points of this example.

### Parabolic Model[edit]

If we have n points, then a polynomial of (n-1) degree will fit these n points exactly. We are given in this example 3 points, and a polynomial of the 2nd degree (parabola) should give us an exact fit. The calculator provides the equation

(-663.1666666653)x^{2} + 4217.91999999x-6294.10448332, giving us

PActual TCalculated TDifference

2.99 388.70 388.6999999956 4.4E-9 = zero plus rounding error 3.29 404.67 404.6699999951 4.9E-9 = zero plus rounding error 3.49 349.00 348.999999995 5.0E-8 = zero plus rounding error

That is a perfect fit, with the **LEAST SQUARES** indicating that this model be used.

### Other models[edit]

Some of the many other models are based on the exponential function, logarithms, and various manipulations of the independent and/or the dependent variable(s). The "best fit" is usually the one that provides the **LEAST SQUARES**. Also weighting of the data could be used when some points on a graph are more important than others (such as, maybe, end points, for example).

- Caution: Some calculators may require for Curve fitting consecutive, equally spaced, independent variables. Always compare the original graph with the "fitted" graph.