# Statistics/Distributions

How are the results of the latest SAT test? What is the average height of females under 21 in Zambia? How does beer consumption among college students at engineering college compare to college students in liberal arts colleges?

To answer these questions, we would collect data and put them in a form that is easy to summarize, visualize, and discuss. Loosely speaking, the collection and aggregation of data result in a distribution. Distributions are most often in the form of a histogram or a table. That way, we can "see" the data immediately and begin our scientific inquiry.

For example, if we want to know more about students' latest performance on the SAT, we would collect SAT scores from ETS, compile them in a way that is pertinent to us, and then form a distribution of these scores. The result may be a data table or it may be a plot. Regardless, once we "see" the data, we can begin asking more interesting research questions about our data.

The distributions we create often parallel distributions that are mathematically generated. For example, if we obtain the heights of all high school students and plot this data, the graph may resemble a normal distribution, which is generated mathematically. Then, instead of painstakingly collecting heights of all high school students, we could simply use a normal distribution to approximate the heights without sacrificing too much accuracy.

In the study of statistics, we focus on mathematical distributions for the sake of simplicity and relevance to the real-world. Understanding these distributions will enable us to visualize the data easier and build models quicker. However, they cannot and do not replace the work of manual data collection and generating the actual data distribution.

What percentage lie within a certain range? Distributions show what percentage of the data lies within a certain range. So, given a distribution, and a set of values, we can determine the probability that the data will lie within a certain range.

The same data may lead to different conclusions if it is interposed on different distributions. So, it is vital in all statistical analysis for data to be put onto the correct distribution.

Distributions

## Comparison of Some Distributions

Some Distributions
Name Notation Formula Symbols Use Continuous/ discrete Notes
Bernoulli f(x)= $p^{x}(1-p)^{1-x}$ p
x
2 outcomes Discrete 1 trial
Binomial b(x;n, p)= ${n \choose k}{p^{k}(1-p)^{n-k}}$ n trials
k successes
p probability
number of times success
specific probabilities
not random
Discrete
Poisson P(x)= ${\frac {e^{-\lambda t}(\lambda t)^{x}}{x!}}$ $\mu =\lambda t$ $\sigma ^{2}=\lambda t$ outcome/time
outcome/region
Discrete
Hypergeometric h(x;N,n,k) = ${{{k \choose x}{{N-k} \choose {n-x}}} \over {N \choose n}}$ n samples from
N items
k of N items are successes,
N-k are failures
X times success occurs
irregardless of location
is random
Discrete Without Replacement
Multivariate
Hypergeometric
$h(x_{1},x_{2}...,x_{k};a_{1},a_{2},...a_{k},N,n){=}$ ${a_{1} \choose x_{1}}{a_{2} \choose x_{2}}\dots {a_{k} \choose x_{k}} \over {N \choose n}$ n sample size
N items
k cells $A_{1}\dots A_{k}$ each with $a_{1}\dots a_{k}$ elements
Discrete Without Replacement
Normal $\int _{a}^{b}n(x;\mu ,\sigma ){=}$ $Z{=}{\frac {x-\mu }{\sigma }}$ x
$\mu$ average
$\sigma$ std dev
:${\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-(x-\mu )^{2}/2\sigma ^{2}}$ Continuous Z is a random variable with
$\mu {=}0and\sigma ^{2}{=}1$ Chi-Square $\chi ^{2}{=}$ ${\frac {(n-1)s^{2}}{\sigma ^{2}}}$ $S^{2}$ variance of a rand sampp of size n taken from norm pop w/ var $\sigma ^{2}$ variances of random sample related to the pop Continuous
Student-t T= ${\frac {{\bar {X}}-\mu }{S/{\sqrt {n}}}}$ ${\bar {X}}$ mean of rand samp size n If don't know $\sigma$ Continuous v=n-1
F F= ${\frac {\sigma _{2}^{2}S_{1}^{2}}{\sigma _{1}^{2}S_{2}^{2}}}{=}{\frac {S_{1}^{2}}{S_{2}^{2}}}$ Continuous