R Programming/Descriptive Statistics

R Basics
Data Management
Graphics
- Grammar of graphics
Publication quality output
Descriptive Statistics
Mathematics
Statistical Core Methods
Regression Models
Time Series
Factor Analysis
Classification
- Ordination
- Clustering
Network Analysis
High Performance Computing
- Profiling R code
- Parallel computing with R
Appendix
- Sources
- Index

In this section, we present descriptive statistics, ie a set of tools to describe and explore data. This mainly includes univariate and bivariate statistical tools.

Generic Functions

We introduce some functions to describe a dataset.

names() gives the names of each variable
str() gives the structure of the dataset
summary() gives the mean, median, min, max, 1st and 3rd quartile of each variable in the data.

> summary(mydat)

describe() (Hmisc package) gives more details than summary()

> library("Hmisc")
> describe(mydat)

contents() (Hmisc package)
dims() in the Zelig package.
descr() in the descr package gives min, max, mean and quartiles for continuous variables, frequency tables for factors and length for character vectors.
whatis() (YaleToolkit) gives a good description of a dataset.
detail() in the SciencesPo package gives a broad range of statistics for continuous variables, frequency tables for factors and length for character vectors.
describe() in the psych package also provides summary statistics:

> x = runif(100)
> y = rnorm(100)
> z = rt(100,1)
> sample.data = x*y*z
> require(psych)
Loading required package: psych
> describe(cbind(sample.data,x,z,y))
            var   n  mean   sd median trimmed  mad    min   max range  skew kurtosis   se
sample.data   1 100  0.37 3.21   0.00    0.07 0.31  -9.02 24.84 33.86  4.79    36.91 0.32
x             2 100  0.54 0.28   0.56    0.55 0.35   0.02  1.00  0.98 -0.12    -1.13 0.03
z             3 100  0.12 6.28   0.02   -0.01 1.14 -30.40 37.93 68.33  1.49    22.33 0.63
y             4 100 -0.01 1.07   0.09   -0.02 1.12  -2.81  2.35  5.16  0.00    -0.30 0.11

Univariate analysis

Continuous variable

Moments

mean() computes the mean
the variance : var().
the standard deviation sd().
the skewness skewness() (fUtilities, moment or e1071)
the kurtosis : kurtosis() (fUtilities, moment or e1071)
all the moments : moment() (moment) and all.moments() (moment).

> library(moments)
>  x <- rnorm(1000)
> moment(x,order = 2) # the variance
[1] 0.999782
> all.moments(x, order.max = 4) # mean, variance, skewness and kurtosis
[1] 1.000000000 0.006935727 0.999781992 0.062650605 2.972802009
> library("e1071")
> moment(x,order = 3) # the skewness
[1] 0.0626506

Order statistics

the range, the minimum and the maximum : range() returns the range of a vector (minimum and maximum of a vector), min() the minimum and max() the maximum.
IQR() computes the interquartile range. median() computes the median and mad() the median absolute deviation.
quantile(), hdquantile() in the Hmisc package and kuantile() in the quantreg packages computes the sample quantiles of a continuous vector. kuantile() may be more efficient when the sample size is big.

> library(Hmisc)
> library(quantreg)
> x <- rnorm(1000)
> seq <- seq(0, 1, 0.25)
> quantile(x, probs = seq, na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
> hdquantile(x, probs = seq, se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
       0.00        0.25        0.50        0.75        1.00 
-3.07328999 -0.66901899  0.02157989  0.72378407  2.92897970 
> kuantile(x, probs = seq(0, 1, .25), na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
attr(,"class")
[1] "kuantile"

Inequality Index

The gini coefficient : Gini() (ineq) and gini() (reldist).
ineq() (ineq) gives all inequalities index.

> library(ineq)
> x <- rlnorm(1000)
> Gini(x)
[1] 0.5330694
> RS(x) #  Ricci-Schutz coefficient
[1] 0.3935813
> Atkinson(x, parameter = 0.5)
[1] 0.2336169
> Theil(x, parameter = 0)
[1] 0.537657
> Kolm(x, parameter = 1)
[1] 0.7216194
> var.coeff(x, square = FALSE)
[1] 1.446085
> entropy(x, parameter = 0.5)
[1] 0.4982675
> library("reldist")
> gini(x)
[1] 0.5330694

Concentration index

> library(ineq)
> Herfindahl(x)
[1] 0.003091162
>  Rosenbluth(x)
[1] 0.002141646

Poverty index

> library(ineq)
> Sen(x,median(x)/2)
[1] 0.1342289
> ?pov # learn more about poverty index

Plotting the distribution

We can plot the distribution using a box plot (boxplot()), an histogram (hist()), a kernel estimator (plot() with density()) or the empirical cumulative distribution function (plot() with ecdf()). See the Nonparametric section to learn more about histograms and kernel density estimators. qqnorm() produces a normal QQ plot and qqline() adds a line to the QQ plot which passes through the first and the third quartile.

A box-plot is a graphical representation of the minimum, the first quartile, the median, the third quartile and the maximum.
stripchart() and stem() are also availables.

> x <- rnorm(10^3)
> hist(x)
> plot(density(x))
> boxplot(x)
> plot(ecdf(x)) # plots the empirical distribution function
 
> qqnorm(x)
> qqline(x, col="red") # it does not do the plot but adds a line to existing one

Goodness of fit tests

Kolmogorov Smirnov Test :

The KS test is one sample goodness of fit test. The test statistic is simply the maximum of the absolute value of the difference between the empirical cumulative distribution function and the theoritical cumulative distribution function. KSd() (sfsmisc) gives the critical values for the KS statistic. As an example, we draw a sample from a Beta(2,2) distribution and we test if it fits a Beta(2,2) a Beta(1,1) and a uniform distribution.

> y <- rbeta(1000,2,2) # Draw y in a Beta(2,2) distribution
> ks.test(y,"pbeta",2,2) # Test if it fits a beta(2,2) distribution
> ks.test(y,"pbeta",1,1) # Test if it fits a beta(1,1) distribution
> ks.test(y,"punif") # Test if its fit a uniform distribution (in fact the beta(1,1) is a uniform distribution)

Some tests are specific to the normal distribution. The Lillie Test is an extension of the KS test when the parameters are unknown. This is implemented with the lillie.test() in the nortest package. shapiro.test() implements the Shapiro Wilk Normality Test

> N <- 100
> x <- rnorm(N)
> library("nortest")
> lillie.test(x)

         Lilliefors (Kolmogorov-Smirnov) normality test

data:  x 
D = 0.0955, p-value = 0.9982*
> shapiro.test(x)

	Shapiro-Wilk normality test

data:  x 
W = 0.9916, p-value = 0.7902

Andersen Darling Test :

> library("nortest")
> ad.test(x)

	Anderson-Darling normality test

data:  x 
A = 0.2541, p-value = 0.7247

See also the package ADGofTest for another version of this test^[1].

Shapiro-Francia normality test :

> sf.test(x)

	Shapiro-Francia normality test

data:  x 
W = 0.9866, p-value = 0.9953

Pearson chi-square normality test :

> library("nortest")
> pearson.test(x)

	Pearson chi-square normality test

data:  x 
P = 0.8, p-value = 0.8495

Cramer-von Mises normality test

> cvm.test(x)

	Cramer-von Mises normality test

data:  x 
W = 0.0182, p-value = 0.9756

Jarque-Bera test :

> jarque.bera.test(x)

	Jarque Bera Test

data:  x 
X-squared = 0.6245, df = 2, p-value = 0.7318

Discrete variable

We generate a discrete variable using sample() and we tabulate it using table(). We can plot using a pie chart (pie()), a bar chart (barplot() or barchart() (lattice)) or a dot chart (dotchart() or dotplot() (lattice)).

freq() (descr) prints the frequency, the percentages and produces a barplot. It supports weights.

> x <- sample(c("A","B","C"),100,replace=T)
> tab <- table(x)
> tab
> prop.table(tab)
> pie(tab)
> barplot(tab)
> dotchart(tab)
> library("descr")
> freq(x) 
x 
      Frequency Percent
A            32      32
B            34      34
C            34      34
Total       100     100

Multivariate analysis

Continuous variables

Covariance : cov()
Pearson's linear correlation : cor().
Pearson's correlation test cor.test() performs the test.
Spearman's rank correlation :
- cor() with method = "spearman".
- spearman() (Hmisc)
Spearman's rank correlation test :
- spearman2() (Hmisc)
- spearman.test() (Hmisc)
- spearman.test() (pspearman package) performs the Spearman’s rank correlation test with precomputed exact null distribution for n <= 22.
Kendall's correlation : cor() with method = "kendall". See also the Kendall package.

> N <- 100
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> y <- 1 + x1 + x2 + rnorm(N)
> plot(y ~ x1 ) # Scatter plot 
> mydat <- data.frame(y,x1,x2)
> cor(mydat)
> cor(mydat, method = "spearman")
> cor(mydat, method = "kendall")
> cor.test(mydat$x1,mydat$x2, method = "pearson")
> cor.test(mydat$x1,mydat$x2, method = "spearman")
> cor.test(mydat$x1,mydat$x2, method = "kendall")

Discrete variables

table(), xtabs() and prop.table() for contingency tables. ftable() (stats package) for a flat (nested) table.
assocplot() and mosaicplot() for graphical display of contingency table.
CrossTable() (descr) is similar to SAS Proc Freq. It returns a contingency table with Chi square and Fisher independence tests.
my.table.NA() and my.table.margin() (cwhmisc)
chisq.detail() (TeachingDemos)

Discrete and Continuous variables

bystats() Statistics by Categories in the Hmisc package
summaryBy() (doBy)
Multiple box plots : plot() or boxplot()

> N <- 100
> x <- sample(1:4,N, replace = T) 
> y <- x + rnorm(N)
> plot(y ~ x) # scatter plot
> plot(y ~ as.factor(x)) # multiple box plot
> boxplot(y ~ x) # multiple box plot
> bystats(y , as.factor(x), fun = mean) 
> bystats(y , as.factor(x), fun = quantile)

Equality of two sample mean t.test() and wilcox.test(), Equality of variance var.test(), equality of two distributions ks.test().

N <- 100
x <- sample(0:1,N, replace = T) 
y <- x + rnorm(N)
t.test(y ~ x )
wilcox.test(y ~ x)

References

↑ Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. http://CRAN.R-project.org/package=ADGofTest

Previous: Graphics

Index

Next: Linear Models

[1] Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. http://CRAN.R-project.org/package=ADGofTest

[1]