Introducing Julia/DataFrames

From Wikibooks, open books for an open world
Jump to: navigation, search
« Introducing Julia
DataFrames
»
Modules and packages

DataFrames[edit]

This chapter is a brief introduction to Julia's DataFrames package.

A DataFrame is a data structure like a table or spreadsheet. You can use it for storing and exploring a set of related data values. Think of it as a smarter array for holding tabular data.

To explore the use of dataframes, we'll start by examining a well-known statistics dataset called Anscombe's Quartet.

You'll probably have to download and install the DataFrames and RDatasets packages, if you've not used them before, because they're not (yet) part of a standard Julia installation:

Pkg.add("DataFrames")
Pkg.add("RDatasets")

You have to do this just once.

Loading data into dataframes[edit]

To get started, create a new DataFrame by loading the Anscombe dataset and assigning it to a variable anscombe:

julia> using RDatasets, DataFrames
julia> anscombe = dataset("datasets","anscombe")
11x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1    | Y2   | Y3    | Y4   |
|-----|----|----|----|----|-------|------|-------|------|
| 1   | 10 | 10 | 10 | 8  | 8.04  | 9.14 | 7.46  | 6.58 |
| 2   | 8  | 8  | 8  | 8  | 6.95  | 8.14 | 6.77  | 5.76 |
| 3   | 13 | 13 | 13 | 8  | 7.58  | 8.74 | 12.74 | 7.71 |
| 4   | 9  | 9  | 9  | 8  | 8.81  | 8.77 | 7.11  | 8.84 |
| 5   | 11 | 11 | 11 | 8  | 8.33  | 9.26 | 7.81  | 8.47 |
| 6   | 14 | 14 | 14 | 8  | 9.96  | 8.1  | 8.84  | 7.04 |
| 7   | 6  | 6  | 6  | 8  | 7.24  | 6.13 | 6.08  | 5.25 |
| 8   | 4  | 4  | 4  | 19 | 4.26  | 3.1  | 5.39  | 12.5 |
| 9   | 12 | 12 | 12 | 8  | 10.84 | 9.13 | 8.15  | 5.56 |
| 10  | 7  | 7  | 7  | 8  | 4.82  | 7.26 | 6.42  | 7.91 |
| 11  | 5  | 5  | 5  | 8  | 5.68  | 4.74 | 5.73  | 6.89 |

The output you see depends on whether you're working in a terminal or using an IJulia notebook. But in either case you can see that you have a table of data, with 8 named columns (X1 to Y4), and 11 rows (1 to 11). The first of Anscombe's four datasets consists of columns X1 and Y1, the second consists of columns X2 and Y2, and so on. Because the columns are named, it's easy to refer to specific columns when processing or analyzing the data.

julia> typeof(anscombe)
DataFrame (constructor with 11 methods)

To obtain a list of column names, use the names() function:

julia> names(anscombe)
8-element Array{Symbol,1}:
 :X1
 :X2
 :X3
 :X4
 :Y1
 :Y2
 :Y3
 :Y4

If you want to access specific columns of a dataframe, you can do it at least in three ways: To access 3rd and 5th columns of the data set anscombe, you can use (a) anscombe[:,[:X3, :Y1]] or (b) anscombe[:, [3, 5]] or (c) anscombe[:, [symbol("X3"), symbol("Y1")]] . Here, (b) and (c) are quite useful, especially if you want to access the columns using variables. For example, suppose a=3, and b=5. Then anscombe[:, [a, b]] will give you the above result. Similarly, if a="X3" and b="Y1", then anscombe[:,[symbol(a), symbol(b)]] will give you the above result.

Other useful functions:

julia> size(anscombe) # number of rows, number of columns
(11,8)
julia> length(anscombe)
8

The head() function (and the matching tail()) is useful for larger datasets because it quickly shows the first (or last) half dozen rows, good for seeing what you've loaded:

julia> head(anscombe)
6x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1   | Y2   | Y3    | Y4   |
|-----|----|----|----|----|------|------|-------|------|
| 1   | 10 | 10 | 10 | 8  | 8.04 | 9.14 | 7.46  | 6.58 |
| 2   | 8  | 8  | 8  | 8  | 6.95 | 8.14 | 6.77  | 5.76 |
| 3   | 13 | 13 | 13 | 8  | 7.58 | 8.74 | 12.74 | 7.71 |
| 4   | 9  | 9  | 9  | 8  | 8.81 | 8.77 | 7.11  | 8.84 |
| 5   | 11 | 11 | 11 | 8  | 8.33 | 9.26 | 7.81  | 8.47 |
| 6   | 14 | 14 | 14 | 8  | 9.96 | 8.1  | 8.84  | 7.04 |

The showcols() function is useful for showing the type of each column.

julia> showcols(anscombe)
11x8 DataFrame
| Col # | Name | Eltype  | Missing |
|-------|------|---------|---------|
| 1     | X1   | Int64   | 0       |
| 2     | X2   | Int64   | 0       |
| 3     | X3   | Int64   | 0       |
| 4     | X4   | Int64   | 0       |
| 5     | Y1   | Float64 | 0       |
| 6     | Y2   | Float64 | 0       |
| 7     | Y3   | Float64 | 0       |
| 8     | Y4   | Float64 | 0       |

Notice that some of the columns (all the X columns) contain integer values, and others (all the Y columns) are floating-point numbers. Every element in a column of a DataFrame has the same data type, but different columns can have different types — this makes the DataFrame ideal for storing tabular data - strings in one column, numeric values in another, and so on.

Referring to columns and rows[edit]

You can refer to columns by number or name. Following the general Julia convention for symbol names, you precede the column names with a colon (:). So :Y2 refers to the column called Y2, or column number 6. So here's how to refer to column Y2 by name:

julia> anscombe[:Y2]
11-element DataArray{Float64,1}:
 9.14
 8.14
 8.74
 8.77
 9.26
 8.1 
 6.13
 3.1 
 9.13
 7.26
 4.74

and here's how to refer to the same column by its number:

julia> anscombe[6]
11-element DataArray{Float64,1}:
 9.14
 8.14
 8.74
 8.77
 9.26
 8.1 
 6.13
 3.1 
 9.13
 7.26
 4.74


Referring to specific columns[edit]

To access 3rd and 5th columns of the data set anscombe, for example, you can use various methods. For example:

1 anscombe[:,[:X3, :Y1]]

2 anscombe[:, [3, 5]]

3 anscombe[:, [Symbol("X3"), Symbol("Y1")]]

The second and third methods are useful if you want to use variables to access columns. For example, if you set a to 3 and b to 5, you can access columns 3 and 5 with:

anscombe[:, [a, b]]

Similarly:

a="X3"
b="Y1"
anscombe[:,[Symbol(a), Symbol(b)]] 

would also work.

Rectangular "slices"[edit]

You can look at any rectangular slice or subset of the dataset, using either index numbers or names.

julia> anscombe[4:6,[:X2,:X4]]
3x2 DataFrame
| Row | X2 | X4 |
|-----|----|----|
| 1   | 9  | 8  |
| 2   | 11 | 8  |
| 3   | 14 | 8  |

which shows rows 4, 5, and 6, and columns X2 and X4. Instead of a range of rows, you can specify individual rows using commas.

julia> anscombe[[4,6, 9],[:X2,:X4]]
3x2 DataFrame
| Row | X2 | X4 |
|-----|----|----|
| 1   | 9  | 8  |
| 2   | 14 | 8  |
| 3   | 12 | 8  |

which shows rows 4, 6, and 9, and columns X2 and X4.

Or, using index numbers:

julia> anscombe[[4,6,8],[2, 6, 8]]
3x3 DataFrame
| Row | X2 | Y2   | Y4   |
|-----|----|------|------|
| 1   | 9  | 8.77 | 8.84 |
| 2   | 14 | 8.1  | 7.04 |
| 3   | 4  | 3.1  | 12.5 |

This is how to specify a range of columns — use index numbers:

julia> anscombe[4:6,3:5]
3x3 DataFrame
| Row | X3 | X4 | Y1   |
|-----|----|----|------|
| 1   | 9  | 8  | 8.81 |
| 2   | 11 | 8  | 8.33 |
| 3   | 14 | 8  | 9.96 |

Notice that the row numbering for the returned dataframes is different — in the last example, rows 4, 5, and 6 became rows 1, 2, and 3 in the new dataframe.

As with arrays, use the colon on its own to specify 'all' columns or rows:

Rows 4, 6, 8, and 11, showing all columns:

julia> anscombe[[4,6,8,11],:]
4x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1   | Y2   | Y3   | Y4   |
|-----|----|----|----|----|------|------|------|------|
| 1   | 9  | 9  | 9  | 8  | 8.81 | 8.77 | 7.11 | 8.84 |
| 2   | 14 | 14 | 14 | 8  | 9.96 | 8.1  | 8.84 | 7.04 |
| 3   | 4  | 4  | 4  | 19 | 4.26 | 3.1  | 5.39 | 12.5 |
| 4   | 5  | 5  | 5  | 8  | 5.68 | 4.74 | 5.73 | 6.89 |

All rows, columns X1 and Y1:

julia> anscombe[:, [:X1, :Y1]]
11x2 DataFrame
| Row | X1 | Y1    |
|-----|----|-------|
| 1   | 10 | 8.04  |
| 2   | 8  | 6.95  |
| 3   | 13 | 7.58  |
| 4   | 9  | 8.81  |
| 5   | 11 | 8.33  |
| 6   | 14 | 9.96  |
| 7   | 6  | 7.24  |
| 8   | 4  | 4.26  |
| 9   | 12 | 10.84 |
| 10  | 7  | 4.82  |
| 11  | 5  | 5.68  |

Selecting rows with conditions[edit]

You can select the rows of a dataframe where the elements satisfy one or more conditions.

Here's how to select rows where the value of the element in column Y1 is greater than 7.0:

julia> anscombe[anscombe[:Y1] .> 7.0, :]
7x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1    | Y2   | Y3    | Y4   |
|-----|----|----|----|----|-------|------|-------|------|
| 1   | 10 | 10 | 10 | 8  | 8.04  | 9.14 | 7.46  | 6.58 |
| 2   | 13 | 13 | 13 | 8  | 7.58  | 8.74 | 12.74 | 7.71 |
| 3   | 9  | 9  | 9  | 8  | 8.81  | 8.77 | 7.11  | 8.84 |
| 4   | 11 | 11 | 11 | 8  | 8.33  | 9.26 | 7.81  | 8.47 |
| 5   | 14 | 14 | 14 | 8  | 9.96  | 8.1  | 8.84  | 7.04 |
| 6   | 6  | 6  | 6  | 8  | 7.24  | 6.13 | 6.08  | 5.25 |
| 7   | 12 | 12 | 12 | 8  | 10.84 | 9.13 | 8.15  | 5.56 |

The 'inner' phrase anscombe[:Y1] .> 7.0 carries out an element-wise comparison of the values in column Y1, returning an array of Boolean true or false values, one for each row. These are then used to select rows from the dataframe. It's as if you'd entered:

julia> anscombe[[true, false, true, true, true, true, true, false, true, false, false], :]
7x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1    | Y2   | Y3    | Y4   |
|-----|----|----|----|----|-------|------|-------|------|
| 1   | 10 | 10 | 10 | 8  | 8.04  | 9.14 | 7.46  | 6.58 |
| 2   | 13 | 13 | 13 | 8  | 7.58  | 8.74 | 12.74 | 7.71 |
| 3   | 9  | 9  | 9  | 8  | 8.81  | 8.77 | 7.11  | 8.84 |
| 4   | 11 | 11 | 11 | 8  | 8.33  | 9.26 | 7.81  | 8.47 |
| 5   | 14 | 14 | 14 | 8  | 9.96  | 8.1  | 8.84  | 7.04 |
| 6   | 6  | 6  | 6  | 8  | 7.24  | 6.13 | 6.08  | 5.25 |
| 7   | 12 | 12 | 12 | 8  | 10.84 | 9.13 | 8.15  | 5.56 |

In a similar way, here's a result that contains every row where the value of the number in column Y1 is greater than that in column Y2:

julia> anscombe[anscombe[:Y1] .> anscombe[:Y2], :]
6x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1    | Y2   | Y3   | Y4   |
|-----|----|----|----|----|-------|------|------|------|
| 1   | 9  | 9  | 9  | 8  | 8.81  | 8.77 | 7.11 | 8.84 |
| 2   | 14 | 14 | 14 | 8  | 9.96  | 8.1  | 8.84 | 7.04 |
| 3   | 6  | 6  | 6  | 8  | 7.24  | 6.13 | 6.08 | 5.25 |
| 4   | 4  | 4  | 4  | 19 | 4.26  | 3.1  | 5.39 | 12.5 |
| 5   | 12 | 12 | 12 | 8  | 10.84 | 9.13 | 8.15 | 5.56 |
| 6   | 5  | 5  | 5  | 8  | 5.68  | 4.74 | 5.73 | 6.89 |

Combining two or more conditions is also possible. Here's a result consisting of rows where the value of Y1 is greater than 5.0 and that of Y2 is less than 7.0:

julia> anscombe[(anscombe[:Y1] .> 5.0) .& (anscombe[:Y2] .< 7.0), :]
2x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1   | Y2   | Y3   | Y4   |
|-----|----|----|----|----|------|------|------|------|
| 1   | 6  | 6  | 6  | 8  | 7.24 | 6.13 | 6.08 | 5.25 |
| 2   | 5  | 5  | 5  | 8  | 5.68 | 4.74 | 5.73 | 6.89 |

Applying functions to columns and rows[edit]

You can apply a function to a column. To find out the mean of the values in the column named X2:

julia> mean(anscombe[:X2])
9.0

You can apply a function to a group of columns at the same time by using colwise(). For example, to apply the mean() function to every column of the dataframe:

julia> colwise(mean, anscombe)
8-element Array{Any,1}:
 [9.0]    
 [9.0]    
 [9.0]    
 [9.0]    
 [7.50091]
 [7.50091]
 [7.5]    
 [7.50091]

which returns a new array containing the mean values for each column.

To apply a function to one or more specific columns, use the following syntax:

julia> colwise(mean, anscombe[[:X1, :X2, :Y3, :Y4]])
4-element Array{Any,1}:
 [9.0]    
 [9.0]    
 [7.5]    
 [7.50091]

The Dataframes package provides two convenient utilities, eachcol() and eachrow(). These can be used for iterating through every column or every row. Each value is a tuple of Symbol (column heading) and DataArray.

for col in eachcol(anscombe)
   println(col)
end

(:X1,[10,8,13,9,11,14,6,4,12,7,5])
(:X2,[10,8,13,9,11,14,6,4,12,7,5])
(:X3,[10,8,13,9,11,14,6,4,12,7,5])
(:X4,[8,8,8,8,8,8,8,19,8,8,8])
(:Y1,[8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68])
(:Y2,[9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74])
(:Y3,[7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73])
(:Y4,[6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89])

Here's the mean of each column, looking at the second part of each tuple provided by col:

julia> for col in eachcol(anscombe)
           println(mean(col[2]))
        end
9.0
9.0
9.0
9.0
7.500909090909093
7.500909090909091
7.500000000000001
7.50090909090909

The eachrow() function provides an iterator for rows:

for r in eachrow(anscombe)
    println(r)
end

DataFrameRow (row 1)
X1  10
X2  10
X3  10
X4  8
Y1  8.04
Y2  9.14
Y3  7.46
Y4  6.58
...
DataFrameRow (row 11)
X1  5
X2  5
X3  5
X4  8
Y1  5.68
Y2  4.74
Y3  5.73
Y4  6.89 

Since each element of each row is a number, we could, in fact, in this case use eachrow() to find the fairly meaningless mean of each row:

julia> for row in eachrow(anscombe)
          println(mean(convert(Array, row)))
       end
8.6525
7.4525
10.47125
8.56625
9.358749999999999
10.492500000000001
6.3375
7.03125
9.71
6.92625
5.755000000000001

Plotting Anscombe's Quartet[edit]

Now let’s shift focus to statistics and compute the mean and standard deviation of every column, and the correlation coefficient (the Pearson product-moment correlation coefficient, or r) between x and y for each dataset. The functions we will use are mean(), median(), std(), and cor().

A first rough version of code to do this uses an anonymous function with two arguments, xcol and col, which are mapped over the X and Y columns of the dataframe:

# print a header
println("Column\tMeanX\tMedianX\tStdDev X\tMeanY\t\t\tStdDev Y\t\tCorr\t")

map((xcol,ycol) -> println(
    xcol,                   "\t",
    mean(anscombe[xcol]),   "\t", 
    median(anscombe[xcol]), "\t", 
    std(anscombe[xcol]),    "\t", 
    mean(anscombe[ycol]),   "\t", 
    std(anscombe[ycol]),    "\t", 
    cor(anscombe[xcol], anscombe[ycol])), 
    
    [:X1, :X2, :X3, :X4], 
    [:Y1, :Y2, :Y3, :Y4]);

Column  MeanX   MedianX StdDev X        MeanY                   StdDev Y                Corr    
X1      9.0     9.0     3.3166247903554 7.500909090909093       2.031568135925815       0.81642051634484
X2      9.0     9.0     3.3166247903554 7.500909090909091       2.0316567355016177      0.8162365060002428
X3      9.0     9.0     3.3166247903554 7.500000000000001       2.0304236011236667      0.8162867394895981
X4      9.0     8.0     3.3166247903554 7.50090909090909        2.0305785113876023      0.8165214368885028

Anscombe’s Quartet comprises four datasets that have nearly identical simple statistical properties, but are very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of looking at your data, and graphing that data before relying on the summary statistics, and the effect of outliers on statistical properties.

As you can see from the output, each of the four datasets has the same mean, median, standard deviation, and correlation coefficient between x and y. If you just used the simple summary statistics to evaluate the four sets, you would assume that they were pretty similar.

By the way, there's a quicker and easier way to do this, if you don't want to write specific code and don't mind the formatting. Use the built-in describe() function:

describe(anscombe)
X1
Min      4.0
1st Qu.  6.5
Median   9.0
Mean     9.0
3rd Qu.  11.5
Max      14.0
NAs      0
NA%      0.0%

X2
Min      4.0
1st Qu.  6.5
Median   9.0
Mean     9.0
3rd Qu.  11.5
Max      14.0
NAs      0
NA%      0.0%
...
Y3
Min      5.39
1st Qu.  6.25
Median   7.11
Mean     7.500000000000001
3rd Qu.  7.98
Max      12.74
NAs      0
NA%      0.0%

Y4
Min      5.25
1st Qu.  6.17
Median   7.04
Mean     7.50090909090909
3rd Qu.  8.190000000000001
Max      12.5
NAs      0
NA%      0.0%

Having looked at the summary statistics, it's time to reveal Anscombe's true purpose, by plotting the four sets of his quartet.

using StatPlots
@df anscombe scatter([:X1 :X2 :X3 :X4], [:Y1 :Y2 :Y3 :Y4],
           smooth=true,
           line = :red,
           linewidth = 2,
           title= ["X$i vs Y$i" for i in (1:4)'],
           legend = false,
           layout = 4,
           xlimits = (2, 20),
           ylimits = (2, 14))

Plotting the anscombe relationships

Now we can see what Anscombe is saying. The simple summary statistics for the four datasets — mean, median, standard deviation, and correlation — are the same, and yet it's clear that they're quite different when plotted.

The first appears to show a simple linear relationship, corresponding to two variables correlated and following the assumption of normality.

The second set of points is not distributed normally; there is an obvious relationship between the two variables, but it isn't linear, and the Pearson correlation coefficient is not really relevant.

In the third set, the distribution is linear, but with a different regression line, which is offset by the one outlier which exerts enough influence to alter the regression line and lower the correlation coefficient from 1 to 0.816.

Finally, the fourth set shows an example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.

The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.

Regression and Models[edit]

If you want to find a linear regression line for the datasets, you can turn to the GLM (Generalized Linear Models) package.

julia> using GLM

To create a linear model, you specify a formula using the @formula) macro, supplying the column names and the name of the DataFrame. The result is a regression model.

julia> linearmodel = fit(LinearModel, @formula(Y1 ~ X1), anscombe)
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Y1 ~ 1 + X1

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   3.00009   1.12475 2.66735   0.0257
X1           0.500091  0.117906 4.24146   0.0022

Useful functions in the GLM package for working with linear models include names(), summary(), and coef().

julia> names(linearmodel)
3-element Array{Symbol,1}:
 :model
 :mf   
 :mm   
julia> summary(linearmodel)
"DataFrameRegressionModel{LinearModel{DensePredQR{Float64}},Float64}"

The coef() function returns the two useful coefficients that define the regression line: the estimated intercept and the estimated slope:

julia> coef(linearmodel)
2-element Array{Float64,1}:
 3.00009 
 0.500091

It's now easy to produce a function for the regression line in the form y = a x + c:

julia> f(x) = coef(linearmodel)[2] * x + coef(linearmodel)[1]
f (generic function with 1 method)

Now that we have f() as a function describing the regression line, it can be drawn in a plot. Here we plot the first series, and add a plot of the function f(x) with x running from 2 to 20, and see how it compares with the smoothing line we used earlier.

p1 = plot(anscombe[:X1], anscombe[:Y1], 
    smooth=true, 
    seriestype=:scatter, 
    title = "X1 vs Y1", 
    linewidth=8,
    linealpha=0.5,
    label="data")

plot!(f, 2, 20, label="correlation")

plotting the anscombe quartet with regression

If you're using another plotting package, such as Bokeh.jl, you can produce a similar graph. Plotting multiple layers is achieved here by 'holding' the plot before showing it:

using RDatasets, DataFrames, Bokeh, GLM
anscombe = dataset("datasets","anscombe")
linearmodel = fit(LinearModel, Y1 ~ X1, anscombe)
f(x) = coef(linearmodel)[2] * x + coef(linearmodel)[1]
plotfile("anscombe.html")
plot(anscombe[:X1], anscombe[:Y1], "rs|bo|g*")
hold(true)
plot([1:20], map(f, [1:20]))
showplot()

plot of anscombe's first quartet using Julia and Bokeh.jl

(The original file is interactive when viewed in the browser.)

Working with dataframes[edit]

Not all datasets are as consistent and tidy as the examples in RDatasets. In the real world, it's possible that you'll read some data into a dataframe, only to discover that it's got a few problems, with inconsistently formatted and/or missing elements.

For this section, we'll create a simple test DataFrame 'by hand', using the @data macro to create the rows. It's a short extract from what might be a periodic table.

ptable = DataFrame(  Number       =   @data([1,   2,    6,    8,    26    ]),
                     Name         =   @data(["Hydrogen",   "Helium",   "Carbon",   "Oxygen",   "Iron"   ]),
                     AtomicWeight =   @data([1.0079,    4.0026,  12.0107, 15.9994, 55.845   ]),
                     Symbol       =   @data(["H",    "He",    "C",    "O",  "Fe"  ]),
                     Discovered   =   @data([1776,   1895,    0,    1774,    NA    ])
)
5x5 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered |
|-----|--------|------------|--------------|--------|------------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | 0          |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         |

The NA problem[edit]

A first look with showcols() reveals that one of the columns has been marked as containing a "missing" value. It's the column that contains the discovered year for Iron, which was marked with "NA" - Not Available or Not Applicable.

julia> showcols(ptable)
5x5 DataFrame
| Col # | Name         | Eltype      | Missing |
|-------|--------------|-------------|---------|
| 1     | Number       | Int64       | 0       |
| 2     | Name         | ASCIIString | 0       |
| 3     | AtomicWeight | Float64     | 0       |
| 4     | Symbol       | ASCIIString | 0       |
| 5     | Discovered   | Int64       | 1       |

The problem of missing fields is one of the important issues you have to confront when working with tabular data. Sometimes, not every field in a table has data. For any number of reasons, there might be missing observations or data points. Unless you're aware of these missing values, they can cause problems. For example, mean and other statistical calculations will be incorrect if you don't notice and account for those strange numbers halfway down the table that someone used to indicate missing values. (Values of -99 in temperature datasets have been known to cause problems...) Also, it can be difficult to apply numeric formula to a mixture of numeric and string values.

You'll come across various ways in which the compilers of the data have indicated that the values are missing. Sometimes the values are represented by 0 or some other 'impossible' number, or by a text string such as "NA" or "n/a". Sometimes — particularly with Tab and Comma-separated files, they're just left empty.

To address this issue, the DataFrames package provides a special data type, the NA type, which indicates that there isn't a usable value at this location. If used consistently, the DataFrames package and its support for NA fields will allow you to get the most out of your datasets, while making sure your calculations are accurate and not 'poisoned' by missing values. The NA in the Iron/Discovered cell allows the DataFrames package to 'tag' that cell as being Missing/NotAvailable.

But there's another problem that's not revealed here. The Discovered year of Carbon was set to 0, chosen to mean 'a long time ago', as it's not a valid year. (After 1 BC comes 1 AD, so effectively 0 as a year is "less than" -10 BC and -2000 BC...). This 0 value should also be marked as being Missing/NotAvailable.

There are three approaches to making a dataframe use NA values consistently:

1: Edit the file/data outside Julia before importing, using a text editor or similar. This is sometimes the quickest and easiest way. In our simple example, we set the missing value to NA directly. This was enough to have the location marked as a NA type.

2: Use the options for readtable() when importing the data. These let you specify rules for converting missing or unwanted values to NA values.

3: Repair the already loaded values using some Julia code.

How to fix missing values with readtable()[edit]

There's a lot of flexibility built in to readtable().

By default, any missing values (empty fields in the table) are replaced by NA values. With the `nastrings` options, you can specify a group of strings that will all be converted to NA values if encountered in the source text:

pt1 = readtable("file.tsv", nastrings=["NA", "na", "n/a", "missing"])

This addresses most of the problems with the varying methods for indicating unavailable values in the original file. It's also possible to add "0" (0 in string form) to the list of nastrings, if it's the case that 0 isn't used for any legitimate values.

Using dataframes with NA values[edit]

If a column contains one or more NA values, you'll find that some calculations don't work.

For example, you can apply a function to ordinary columns easily enough. So it's easy to calculate the mean of the AtomicWeight column:

julia> mean(ptable[:AtomicWeight])
17.77312

But, because there is an NA value in the Discovered column, you can't apply a function to it:

julia> mean(ptable[:Discovered])
NA

There are two ways to fix this: edit the table so that the NA is converted to a real value, or, when running calculations, make sure that an NA-affected cell isn't included.

Looking for NA values and others in DataFrames[edit]

To look for an NA value in a dataframe, you can try writing code using the isna() function. This lets you test a dataframe cell for NA-ness:

julia> nrows, ncols = size(ptable)
julia> for row in 1:nrows
     for col in 1:ncols
       if  isna(ptable[row,col])
        println("$(names(ptable)[col]) value for $(ptable[row,:Name]) is NA")
       end
     end
end
Discovered value for Iron is NA

You might think to use similar code to check for the 0 values for Discovered, and replace them with NA (should you consider that to be a good way to mark an element as being not-discovered in a particular year). But this doesn't work:

julia> for row in 1:nrows
     if ptable[row, :Discovered] == 0
      println("it's zero")
     end
end

giving

it's zero
ERROR: type: non-boolean (NAtype) used in boolean context

The problem here is that you're checking the value of each cell in a Boolean comparison, but you can't compare NA values with numbers: any comparison returns NA ("not a possible") rather than true or false.

Instead, you can write the loop like this:

julia> for row in 1:nrows
     if isna(ptable[row, :Discovered])
      println("skipping NA value")
     else
      println("the value is $(ptable[row, :Discovered])")
     end
end
the value is 1776
the value is 1895
the value is 0
the value is 1774
skipping NA value

The isna.() function is also useful in other contexts. This is a quick way of locating the index numbers of NA values in columns (notice the broadcast operator):

julia> find(isna.(ptable[:, :Discovered]))
1-element Array{Int64,1}:
5

and this next line returns a dataframe of all rows where the Discovered column contains NA-affected values:

julia> ptable[find(isna.(ptable[:,:Discovered])), :]
1x5 DataFrame
| Row | Number | Name   | AtomicWeight | Symbol | Discovered |
|-----|--------|--------|--------------|--------|------------|
| 1   | 26     | "Iron" | 55.845       | "Fe"   | NA         |

By using !isna.() one can return a DataFrame that contains no NA-affected rows.

You can use isna.() to select rows with NA values in specific columns and set them all to a value:

julia> ptable[isna.(ptable[:Discovered]), :Discovered] = 0
0
julia> ptable
5x5 DataFrames.DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered |
|-----|--------|------------|--------------|--------|------------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | 0          |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | 0          |

Repairing dataframes[edit]

To clean your data, you can write a short fragment of code to change values that aren't acceptable. This code looks at every cell and changes any "n/a", "0", or 0 values to NA. Notice that the first test is isna() — that takes care of cases where the element is already an NA value; these are skipped (otherwise the comparisons that follow might fail).

 for row in 1:nrows
    for col in 1:ncols
    println("processing row $row column $col ")
    temp = ptable[row,col]
    if isna(temp)
     println("skipping NA")
    elseif temp == "n/a" || temp == "0" || temp == 0
     ptable[row,col] = NA
     println("changed row $row column $col ")
    end
    end
 end
processing row 1 column 1
processing row 1 column 2
processing row 1 column 3
processing row 1 column 4
processing row 1 column 5
processing row 2 column 1
processing row 2 column 2
processing row 2 column 3
processing row 2 column 4
processing row 2 column 5
processing row 3 column 1
processing row 3 column 2
processing row 3 column 3
processing row 3 column 4
processing row 3 column 5
changed row 3 column 5
processing row 4 column 1
processing row 4 column 2
processing row 4 column 3
processing row 4 column 4
processing row 4 column 5
processing row 5 column 1
processing row 5 column 2
processing row 5 column 3
processing row 5 column 4
processing row 5 column 5
skipping NA
julia> ptable
5x5 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered |
|-----|--------|------------|--------------|--------|------------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | NA         |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         |

Now the Discovered column has two NA values, as the discovery date of Carbon is also now considered to be NA.

Working with NA values: completecases() and dropna()[edit]

To find, say, the maximum value of the Discovered column (which we know contains NA values), you can use the complete_cases() function. This takes a DataFrame and returns flags to indicate which rows are valid. These can then be used to select rows which are guaranteed to not contain NA values:

julia> maximum(ptable[completecases(ptable), :][:Discovered])
1895

This allows you to write code that is guaranteed to work, because rows with NA values are excluded from consideration.

Each column of a DataFrame is stored as a DataArray type. The dropna() function returns a copy of a column without any NA values.

julia> dropna(ptable[:Discovered])
4-element Array{Int64,1}:
 1776
 1895
    0
 1774

So an alternative to the completecases() approach is:

julia> maximum(dropna(ptable[:Discovered]))
1895

Modifying DataFrames[edit]

Adding, deleting, and renaming columns[edit]

To add a column, you could do this:

hcat(ptable, collect(1:size(ptable,1)))

which adds another column of integers (which will be called :x1) from 1 to n (although we haven't changed ptable or assigned the resulting DataFrame to a symbol).

But let's add Melting and Boiling points of our chosen elements:

ptable[:MP] = @data([-259, -272, 3500, -218, 1535])
ptable[:BP] = @data([-253, -269, 4827, -183, 2750])
julia> ptable
5x7 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | MP   | BP   |
|-----|--------|------------|--------------|--------|------------|------|------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | NA         | 3500 | 4827 |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 |

To illustrate how to create a new column based on things in the other columns, we'll add a column called Liquid showing for how many degrees C an element remains liquid (i.e. BP - MP):

julia> ptable[:Liquid] = map((x,y) -> y - x, ptable[:MP], ptable[:BP])
5-element DataArray{Int64,1}:
 6
 3
 1327
   35
 1215
julia> ptable
5x8 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | MP   | BP   | Liquid |
|-----|--------|------------|--------------|--------|------------|------|------|--------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   |

To add or replace a column of a DataFrame with another column of data (ie an array of the right length), use:

julia> ptable[:Temp] = 1:size(ptable,1)
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

Do it for real:

julia> ptable[:Temp] = map((x,y) -> y * x, ptable[:Liquid], ptable[:AtomicWeight])
5-element DataArray{Float64,1}:
  6.0474
 12.0078
 15938.2
   559.979
 67851.7
julia> ptable
5x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | MP   | BP   | Liquid | Temp    |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | 6.0474  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | 12.0078 |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   | 15938.2 |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | 559.979 |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 67851.7 |

First we added a new :Temp column with integers from 1 to n, then the values were replaced with the result of multiplying the atomic weight by the Liquid range. (It could have been done with just a single step, of course.)

You might want to add a column that shows the Melting Point in Fahrenheit:

julia> ptable[:MP_in_F] = map(deg -> 32 + (deg * 1.8), ptable[:MP])
5-element DataArray{Any,1}:
 -434.2
 -457.6
 6332.0
 -360.4
 2795.0
julia> ptable
5x10 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | MP   | BP   | Liquid | Temp    | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | 6.0474  | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | 12.0078 | -457.6  |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   | 15938.2 | 6332.0  |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | 559.979 | -360.4  |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 67851.7 | 2795.0  |

It's easy to rename columns:

julia> rename!(ptable, :Temp, :Junk)

and

julia> rename!(ptable, [:MP, :BP], [:Melt, :Boil])
5x10 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | Junk    | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | 6.0474  | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | 12.0078 | -457.6  |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   | 15938.2 | 6332.0  |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | 559.979 | -360.4  |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 67851.7 | 2795.0  |

(There's also a rename() function, without the exclamation mark, which doesn't change the original DataFrame.)

And to delete a column, use delete!():

julia> delete!(ptable, :Junk)

deletes (notice that ! warning) the named column. You can also delete a set of columns.

Adding and deleting rows[edit]

It's easy to add rows. Use push! and the @data macro, with a bunch of suitable data:

julia> push!(ptable, @data([6, "Copper", 63.546, "Cu", NA, 1083, 2567, NA, NA, NA]))
julia> ptable
6x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   | 6332.0  |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | -360.4  |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 2795.0  |
| 6   | 29     | "Copper"   | 63.546       | "Cu"   | NA         | 1083 | 2567 | NA     | NA      |

Those NAs should be replaced soon using the functions from earlier. We can locate the new element by name and change the value for :Liquid like this:

julia> ptable[[ismatch(r"Copper", elementname) for elementname in ptable[:Name]], :][:Liquid] = 2567 - 1083

or we could use the atomic number to obtain access to the right row and do it that way:

ptable[ptable[:, :Number] .== 6, :][:Liquid] = 4827 - 3500

To delete rows, use the deleterows!() function, with one or more row specifiers:

julia> temp = deepcopy(ptable) # so as not to affect ptable 
julia> deleterows!(temp, 3:5)
3x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  |
| 3   | 29     | "Copper"   | 63.546       | "Cu"   | NA         | 1083 | 2567 | NA     | NA      |

or:

julia> temp = deepcopy(ptable)
julia> deleterows!(temp, [3,4,6])
3x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  |
| 3   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 2795.0  |

Alternatively, you could delete rows by specifying a condition. For example, to delete rows where the Boiling Point is less than 100 C, you could just find the rows that are greater than or equal to 100 C, then assign a variable to the result:

julia> ptable1 = ptable[ptable[:Boil] .> 100, :]
3x9 DataFrame
| Row | Number | Name     | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|----------|--------------|--------|------------|------|------|--------|---------|
| 1   | 6      | "Carbon" | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   | 6332.0  |
| 2   | 26     | "Iron"   | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 2795.0  |
| 3   | 29     | "Copper" | 63.546       | "Cu"   | NA         | 1083 | 2567 | NA     | NA      |

Finding values in DataFrames[edit]

To find values, the basic idea is to use an elementwise operator or function that examines all rows and returns an array of Boolean values to indicate whether each cell meets the criteria for each row:

julia> ptable[:, :Melt] .< 100
6-element DataArray{Bool,1}:
  true
  true
 false
  true
 false
 false

then use this Boolean array to select the rows:

julia> ptable[ptable[:, :Melt] .< 100, :]
3x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  |
| 3   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | -360.4  |

You could use something like this to return rows where a column matches a regular expression:

julia> ptable[[ismatch(r"Copper", elementname) for elementname in ptable[:Name]], :]
1×9 DataFrames.DataFrame
│ Row │ Number │ Name     │ AtomicWeight │ Symbol │ Discovered │ Melt │ Boil │ Liquid │ MP_in_F │
|-----|--------|----------|--------------|--------|------------|------|------|--------|---------|
│ 1   │ 6      │ "Copper" │ 63.546       │ "Cu"   │ NA         │ 1083 │ 2567 │ NA     │ NA      │

and you can edit elements in the same way:

julia> ptable[[ismatch(r"Copper", elementname) for elementname in ptable[:Name]], :][:Liquid] = 2567 - 1083
1484

Here's a function to look for types in a specific column:

julia> ptable[[(typeof(element) == NAtype) for element in ptable[:, :Liquid]], :]
1x9 DataFrame
| Row | Number | Name     | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|----------|--------------|--------|------------|------|------|--------|---------|
| 1   | 29     | "Copper" | 63.546       | "Cu"   | NA         | 1083 | 2567 | NA     | NA      |

Subsets and groups[edit]

To investigate subsets and groupings, we'll add another column:

julia> ptable[:Room] = @data([:Gas, :Gas, :Solid, :Solid, :Solid, :Solid])
6x10 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F | Room  |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|-------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  | Gas   |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  | Gas   |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   | 6332.0  | Solid |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | -360.4  | Gas   |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 2795.0  | Solid |
| 6   | 29     | "Copper"   | 63.546       | "Cu"   | NA         | 1083 | 2567 | NA     | NA      | Solid |

This column gives the state of each element at room temperature.

It's now possible to collect up the elements according to their group membership. The groupby() function splits the original dataframe into a group of subdataframes according to the values in the named column. For example, with three elements that are gases at room temperature, and the others which are solid, we can obtain two subdataframes:

julia> groupby(ptable, [:Room])
 GroupedDataFrame  2 groups with keys: [:Room]
First Group:
3x10 SubDataFrame{Array{Int64,1}}
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F | Room |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  | Gas  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  | Gas  |
| 3   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | -360.4  | Gas  |
⋮
Last Group:
3x10 SubDataFrame{Array{Int64,1}}
| Row | Number | Name     | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F | Room  |
|-----|--------|----------|--------------|--------|------------|------|------|--------|---------|-------|
| 1   | 6      | "Carbon" | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   | 6332.0  | Solid |
| 2   | 26     | "Iron"   | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 2795.0  | Solid |
| 3   | 29     | "Copper" | 63.546       | "Cu"   | NA         | 1083 | 2567 | NA     | NA      | Solid |

Another function, by(), lets you group the rows and then apply a function to one of the fields of every row in the group. In this next example, the anonymous function finds the mean value of all the values for the Melting point, for each group in turn. The result is a single value for each group — the mean melting point of all the gases, and the mean melting point of all the solid elements:

julia> by(ptable, [:Room], df -> mean(df[:Melt]))
2x2 DataFrame
| Row | Room  | x1       |
|-----|-------|----------|
| 1   | Gas   | -249.667 |
| 2   | Solid | 2039.33  |

Sorting[edit]

The by() function can also be used for counting and sorting dataframes by group. The nrow() function creates a new row:

julia> by(ptable, :Room, nrow)
2x2 DataFrame
| Row | Room  | x1 |
|-----|-------|----|
| 1   | Gas   | 3  |
| 2   | Solid | 3  |

The sort!()function works with dataframes as well. You can specify the columns on which to sort, using the following syntax:

 julia> sort!(ptable, cols = [order(:Room), order(:Melt)])
6x10 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F | Room  |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|-------|
| 1   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  | Gas   |
| 2   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  | Gas   |
| 3   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | -360.4  | Gas   |
| 4   | 29     | "Copper"   | 63.546       | "Cu"   | NA         | 1083 | 2567 | NA     | NA      | Solid |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | NA         | 1535 | 2750 | 1215   | 2795.0  | Solid |
| 6   | 6      | "Carbon"   | 12.0107      | "C"    | NA         | 3500 | 4827 | 1327   | 6332.0  | Solid |

The resulting dataframe is sorted first by its state at room temperature, then by its melting point.