# Biostatistics with R/Descriptive Statistics

## The Frequency Distribution[edit]

Example 2.2.1 detailed the procedure to sort an array. This array is a series of ages in subjects received two kinds of smoking cessation program. Suppose you already import the data set using the following command:

> SmokeCProg <- read.csv("/EXA_C01_S04_01.csv", header=T, na.strings=NA)

It is better to use a descriptive name (SmokeCProg for Smoking Cessation Program) rather than commonly used place holder name such as x,y. We can obtain a sorted array of ages using the following command:

> sort(SmokeCProg$AGE)

The frequency distribution of Ages as shown in table 2.3.1 can be obtained using:

> table(cut(SmokeCProg$AGE, b=c(0,39,49,59,69,79,89))) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 11 46 70 45 16 1

*cut* command break up AGE variables based on the break points (0,39,49,59,69,79,89) provided. In table 2.3.2, the frequency table of age was provided. As suggested by Venables et al. in the book "An Introduction to R", statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Compared to other statistical packages, R will only give minimal output. We will demonstrate this important characteristic in this example. In previous example, we calculated the frequency distribution of Ages using *table()* and *cut()* command. We can store the results in form of a object called "AgeFreqTable" using:

> AgeFreqTable <- table(cut(SmokeCProg$AGE, b=c(0,39,49,59,69,79,89)))

You will get no output. Until you call the object "AgeFreqTable"

> AgeFreqTable (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 11 46 70 45 16 1

In order to obtain the cumulative frequency, we can process the object "AgeFreqTable" using *cumsum()* command

> cumsum(AgeFreqTable) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 11 57 127 172 188 189

Before we jump to the calculation of relative frequency, we can obtain the total number of observations in a variable using *length()* function

> length(SmokeCProg$AGE) [1] 189

We can calculate the relative frequency by dividing each items in the object "AgeFreqTable" by the total number of observations using

> AgeFreqTable/length(SmokeCProg$AGE) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 0.058201058 0.243386243 0.370370370 0.238095238 0.084656085 0.005291005

Similarly, the cummulative relative frequency can be calculated using

> cumsum(AgeFreqTable)/length(SmokeCProg$AGE) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 0.05820106 0.30158730 0.67195767 0.91005291 0.99470899 1.00000000

If you would like to round the results of relative frequency to 4 digits, you can use the *round()* function

> round (AgeFreqTable/length(SmokeCProg$AGE),digits=4) (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 0.0582 0.3016 0.6720 0.9101 0.9947 1.0000

Alternatively, you can store the results of relative frequency in a new object and then process that object with *round()* function

> AgeRelFreqTable <- AgeFreqTable/length(SmokeCProg$AGE) > round (AgeRelFreqTable, digits=4)

**Exercise:** Try to round the results of cummulative relative frequency to 4 digits using R command To plot a histogram, you can use the *hist()* function, e.g.

> hist(SmokeCProg$AGE)

You can customize the histogram by adding some arguments (i.e. options), you may type ?hist to learn more about the argument of hist() function. For example, if you want to plot a histogram with only five bars (similar to Figure 2.3.2)

> hist(SmokeCProg$AGE, breaks=5)

You can add more arguments to *hist()* functions, e.g.

> hist(SmokeCProg$AGE, breaks=5, ylim=c(0,70), main="Histogram of Ages of 189 subjects", col="red", xlab="Age")

Remember, always consult the document (e.g. ?hist or help.search("histogram") ) when you have question. In 95% of the time, you can find the answer in help document. For example, you don't know how to plot a stem-and-leaf graph to display your data. You don't even know the name of the function. You can use help.search() to search for the keyword "stem", i.e.

> help.search("stem")

A function called *stem()* should be in the results. We then try to use this function to visual our data

> stem(SmokeCProg$AGE) The decimal point is 1 digit(s) to the right of the | 3 | 04 3 | 577888899 4 | 00223333334444444 4 | 55566666677777788888889999999 5 | 0000000011112222223333333333333333344444444444 5 | 555666666777777788999999 6 | 000011111111111222222233444444 6 | 556666667888999 7 | 0111111123 7 | 567888 8 | 2

Not similar to MINITAB, the steam unit is adjusted by the scale argument. The plot above using a default scale of 1 which is equivalent to steam unit =5. To change the steam unit to 10, the value of scale argument should be change to 0.5

> stem(SmokeCProg$AGE, scale=0.5) The decimal point is 1 digit(s) to the right of the | 3 | 04577888899 4 | 0022333333444444455566666677777788888889999999 5 | 00000000111122222233333333333333333444444444445556666667777777889999 6 | 000011111111111222222233444444556666667888999 7 | 0111111123567888 8 | 2