Introducing Julia/DataFrames

From Wikibooks, open books for an open world
Jump to navigation Jump to search
« Introducing Julia
DataFrames
»
Modules and packages

DataFrames[edit]

This chapter is a brief introduction to Julia's DataFrames package.

A DataFrame is a data structure like a table or spreadsheet. You can use it for storing and exploring a set of related data values. Think of it as a smarter array for holding tabular data.

To explore the use of DataFrames, we'll start by examining a well-known statistics dataset called Anscombe's Quartet.

Start by downloading the DataFrames and CSV packages, if you've not used them before:

julia> ]
(v1.0) pkg> add DataFrames
...messages
(v1.0) pkg>

You have to do this just once. Also, you can add CSV.jl while you're there:

(v1.0) pkg> add CSV

To use DataFrames:

julia> using DataFrames

Loading data into DataFrames[edit]

There a few different ways to create new DataFrames. For this introduction, the quickest way to load the Anscombe dataset and assigning it to a variable anscombe is to copy/paste, convert an array, then rename the column names, like this:

julia>  anscombe = convert(DataFrame,                
 [10  10  10  8   8.04   9.14  7.46   6.58;    
  8   8   8   8   6.95   8.14  6.77   5.76;   
  13  13  13  8   7.58   8.74  12.74  7.71;   
  9   9   9   8   8.81   8.77  7.11   8.84;   
  11  11  11  8   8.33   9.26  7.81   8.47;   
  14  14  14  8   9.96   8.1   8.84   7.04;   
  6   6   6   8   7.24   6.13  6.08   5.25;   
  4   4   4   19  4.26   3.1   5.39   12.5;   
  12  12  12  8   10.84  9.13  8.15   5.56;   
  7   7   7   8   4.82   7.26  6.42   7.91;   
  5   5   5   8   5.68   4.74  5.73   6.89]); 
julia> rename!(anscombe, f => t for (f, t) =        
   zip([:x1, :x2, :x3, :x4, :x5, :x6, :x7, :x8],     
       [:X1, :X2, :X3, :X4, :Y1, :Y2, :Y3, :Y4]))   
11×8 DataFrames.DataFrame
│ Row │ X1   │ X2   │ X3   │ X4   │ Y1    │ Y2   │ Y3    │ Y4   │
├─────┼──────┼──────┼──────┼──────┼───────┼──────┼───────┼──────┤
│ 1   │ 10.0 │ 10.0 │ 10.0 │ 8.0  │ 8.04  │ 9.14 │ 7.46  │ 6.58 │
│ 2   │ 8.0  │ 8.0  │ 8.0  │ 8.0  │ 6.95  │ 8.14 │ 6.77  │ 5.76 │
│ 3   │ 13.0 │ 13.0 │ 13.0 │ 8.0  │ 7.58  │ 8.74 │ 12.74 │ 7.71 │
│ 4   │ 9.0  │ 9.0  │ 9.0  │ 8.0  │ 8.81  │ 8.77 │ 7.11  │ 8.84 │
│ 5   │ 11.0 │ 11.0 │ 11.0 │ 8.0  │ 8.33  │ 9.26 │ 7.81  │ 8.47 │
│ 6   │ 14.0 │ 14.0 │ 14.0 │ 8.0  │ 9.96  │ 8.1  │ 8.84  │ 7.04 │
│ 7   │ 6.0  │ 6.0  │ 6.0  │ 8.0  │ 7.24  │ 6.13 │ 6.08  │ 5.25 │
│ 8   │ 4.0  │ 4.0  │ 4.0  │ 19.0 │ 4.26  │ 3.1  │ 5.39  │ 12.5 │
│ 9   │ 12.0 │ 12.0 │ 12.0 │ 8.0  │ 10.84 │ 9.13 │ 8.15  │ 5.56 │
│ 10  │ 7.0  │ 7.0  │ 7.0  │ 8.0  │ 4.82  │ 7.26 │ 6.42  │ 7.91 │
│ 11  │ 5.0  │ 5.0  │ 5.0  │ 8.0  │ 5.68  │ 4.74 │ 5.73  │ 6.89 │ 

Collected datasets[edit]

Alternatively you could download and install the RDatasets package, which contains a number of famous datasets.

julia> ]
(v0.7) pkg> add RDatasets
julia> using RDatasets
julia> anscombe = dataset("datasets","anscombe")

There are other ways to create DataFrames, including reading data from files (using CSV.jl).

Empty DataFrames[edit]

You can create simple DataFrames by providing the information about rows, and column names, in arrays:

julia> df = DataFrame([collect(1:3), collect(4:6)], [:A, :B])

3×2 DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ 4 │
│ 2   │ 2 │ 5 │
│ 3   │ 3 │ 6 │

To create a completely empty DataFrame, you supply the column names (Julia symbols) and define their types (remembering that the columns are arrays):

df = DataFrame(Name=String[], 
    Width=Float64[], 
    Height=Float64[], 
    Mass=Float64[], 
    Volume=Float64[])
0×5 DataFrames.DataFrame
df = vcat(df, DataFrame(Name="Test", Width=1.0, Height=10.0, Mass=3.0, Volume=5.0))
1×5 DataFrames.DataFrame
│ Row │ Name │ Width │ Height │ Mass │ Volume │
├─────┼──────┼───────┼────────┼──────┼────────┤
│ 1   │ Test │ 1.0   │ 10.0   │ 3.0  │ 5.0    │

Basics[edit]

Once the data has been loaded, you should see the DataFrame; its appearance varies if you're working in a terminal or using an IJulia notebook. But in either case you can see that you have a table of data, with 8 named columns (X1 to Y4), and 11 rows (1 to 11). The first 'dataset' is X1/Y1, the second X2/Y2, and so on. Because the columns are named, it's easy to refer to specific columns when processing or analyzing the data.

julia> typeof(anscombe)
DataFrame (constructor with 11 methods)

To obtain a list of column names, use the names() function:

julia> names(anscombe)
8-element Array{Symbol,1}:
 :X1
 :X2
 :X3
 :X4
 :Y1
 :Y2
 :Y3
 :Y4

Other useful functions:

julia> size(anscombe) # number of rows, number of columns
(11, 8)

julia> length(anscombe)
8

The head() function (and the matching tail()) is useful for larger datasets because it quickly shows the first (or last) half dozen rows, good for seeing what you've loaded:

julia> head(anscombe)
6x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1   | Y2   | Y3    | Y4   |
|-----|----|----|----|----|------|------|-------|------|
| 1   | 10 | 10 | 10 | 8  | 8.04 | 9.14 | 7.46  | 6.58 |
| 2   | 8  | 8  | 8  | 8  | 6.95 | 8.14 | 6.77  | 5.76 |
| 3   | 13 | 13 | 13 | 8  | 7.58 | 8.74 | 12.74 | 7.71 |
| 4   | 9  | 9  | 9  | 8  | 8.81 | 8.77 | 7.11  | 8.84 |
| 5   | 11 | 11 | 11 | 8  | 8.33 | 9.26 | 7.81  | 8.47 |
| 6   | 14 | 14 | 14 | 8  | 9.96 | 8.1  | 8.84  | 7.04 |

The showcols() function is useful for showing the type of each column.

julia> showcols(anscombe)
11×8 DataFrames.DataFrame
│ Col # │ Name │ Eltype                           │ Missing │ Values        │
├───────┼──────┼──────────────────────────────────┼─────────┼───────────────┤
│ 1     │ X1   │ Union{Int64, Missings.Missing}   │ 0       │ 10  …  5      │
│ 2     │ X2   │ Union{Int64, Missings.Missing}   │ 0       │ 10  …  5      │
│ 3     │ X3   │ Union{Int64, Missings.Missing}   │ 0       │ 10  …  5      │
│ 4     │ X4   │ Union{Int64, Missings.Missing}   │ 0       │ 8  …  8       │
│ 5     │ Y1   │ Union{Float64, Missings.Missing} │ 0       │ 8.04  …  5.68 │
│ 6     │ Y2   │ Union{Float64, Missings.Missing} │ 0       │ 9.14  …  4.74 │
│ 7     │ Y3   │ Union{Float64, Missings.Missing} │ 0       │ 7.46  …  5.73 │
│ 8     │ Y4   │ Union{Float64, Missings.Missing} │ 0       │ 6.58  …  6.89 │

Notice that some of the columns (all the X columns) contain integer values, and others (all the Y columns) are floating-point numbers. Every element in a column of a DataFrame has the same data type, but different columns can have different types — this makes the DataFrame ideal for storing tabular data - strings in one column, numeric values in another, and so on.

You can refer to columns by number or name. Following the general Julia convention for symbol names, you precede the column names with a colon (:). So :Y2 refers to the column called Y2, or column number 6. So here's how to refer to column Y2 by name:

julia> anscombe[:Y2]
11-element DataArray{Float64,1}:
 9.14
 8.14
 8.74
 8.77
 9.26
 8.1 
 6.13
 3.1 
 9.13
 7.26
 4.74

and here's how to refer to the same column by its number:

julia> anscombe[6]
11-element DataArray{Float64,1}:
 9.14
 8.14
 8.74
 8.77
 9.26
 8.1 
 6.13
 3.1 
 9.13
 7.26
 4.74

Referring to specific columns[edit]

If you want to access specific columns of a DataFrame, you can do it at least in three ways. To access 3rd and 5th columns of the data set anscombe, you can use:

anscombe[:,[:X3, :Y1]]                      (a)

anscombe[:,  [3, 5]]                        (b)

anscombe[:, [Symbol("X3"), Symbol("Y1")]]   (c)
11×2 DataFrames.DataFrame
│ Row │ X3   │ Y1    │
├─────┼──────┼───────┤
│ 1   │ 10.0 │ 8.04  │
│ 2   │ 8.0  │ 6.95  │
│ 3   │ 13.0 │ 7.58  │
│ 4   │ 9.0  │ 8.81  │
│ 5   │ 11.0 │ 8.33  │
│ 6   │ 14.0 │ 9.96  │
│ 7   │ 6.0  │ 7.24  │
│ 8   │ 4.0  │ 4.26  │
│ 9   │ 12.0 │ 10.84 │
│ 10  │ 7.0  │ 4.82  │
│ 11  │ 5.0  │ 5.68  │

Here, (b) and (c) are quite useful, especially if you want to access the columns using variables. For example, suppose a=3, and b=5. Then anscombe[:, [a, b]] will give you the above result. Similarly, if a="X3" and b="Y1", then anscombe[:,[Symbol(a), Symbol(b)]] will give you the above result.

Rectangular "slices"[edit]

You can look at any rectangular slice or subset of the dataset, using either index numbers or names.

julia> anscombe[4:6,[:X2,:X4]]
3x2 DataFrame
| Row | X2 | X4 |
|-----|----|----|
| 1   | 9  | 8  |
| 2   | 11 | 8  |
| 3   | 14 | 8  |

which shows rows 4, 5, and 6, and columns X2 and X4. Instead of a range of rows, you can specify individual rows using commas.

julia> anscombe[[4,6, 9],[:X2,:X4]]
3x2 DataFrame
| Row | X2 | X4 |
|-----|----|----|
| 1   | 9  | 8  |
| 2   | 14 | 8  |
| 3   | 12 | 8  |

which shows rows 4, 6, and 9, and columns X2 and X4.

Or, using index numbers:

julia> anscombe[[4,6,8],[2, 6, 8]]
3x3 DataFrame
| Row | X2 | Y2   | Y4   |
|-----|----|------|------|
| 1   | 9  | 8.77 | 8.84 |
| 2   | 14 | 8.1  | 7.04 |
| 3   | 4  | 3.1  | 12.5 |

To specify a range of columns, use index numbers:

julia> anscombe[4:6,3:5]
3x3 DataFrame
| Row | X3 | X4 | Y1   |
|-----|----|----|------|
| 1   | 9  | 8  | 8.81 |
| 2   | 11 | 8  | 8.33 |
| 3   | 14 | 8  | 9.96 |

Notice that the row numbering for the returned DataFrames is different — in the last example, rows 4, 5, and 6 became rows 1, 2, and 3 in the new DataFrame.

As with arrays, use the colon on its own to specify 'all' columns or rows. Rows 4, 6, 8, and 11, showing all columns:

julia> anscombe[[4,6,8,11],:]
4x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1   | Y2   | Y3   | Y4   |
|-----|----|----|----|----|------|------|------|------|
| 1   | 9  | 9  | 9  | 8  | 8.81 | 8.77 | 7.11 | 8.84 |
| 2   | 14 | 14 | 14 | 8  | 9.96 | 8.1  | 8.84 | 7.04 |
| 3   | 4  | 4  | 4  | 19 | 4.26 | 3.1  | 5.39 | 12.5 |
| 4   | 5  | 5  | 5  | 8  | 5.68 | 4.74 | 5.73 | 6.89 |

All rows, columns X1 and Y1:

julia> anscombe[:, [:X1, :Y1]]
11x2 DataFrame
| Row | X1 | Y1    |
|-----|----|-------|
| 1   | 10 | 8.04  |
| 2   | 8  | 6.95  |
| 3   | 13 | 7.58  |
| 4   | 9  | 8.81  |
| 5   | 11 | 8.33  |
| 6   | 14 | 9.96  |
| 7   | 6  | 7.24  |
| 8   | 4  | 4.26  |
| 9   | 12 | 10.84 |
| 10  | 7  | 4.82  |
| 11  | 5  | 5.68  |

Selecting rows with conditions[edit]

You can select the rows of a DataFrame where the elements satisfy one or more conditions.

Here's how to select rows where the value of the element in column Y1 is greater than 7.0:

julia> anscombe[anscombe[:Y1] .> 7.0, :]
7x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1    | Y2   | Y3    | Y4   |
|-----|----|----|----|----|-------|------|-------|------|
| 1   | 10 | 10 | 10 | 8  | 8.04  | 9.14 | 7.46  | 6.58 |
| 2   | 13 | 13 | 13 | 8  | 7.58  | 8.74 | 12.74 | 7.71 |
| 3   | 9  | 9  | 9  | 8  | 8.81  | 8.77 | 7.11  | 8.84 |
| 4   | 11 | 11 | 11 | 8  | 8.33  | 9.26 | 7.81  | 8.47 |
| 5   | 14 | 14 | 14 | 8  | 9.96  | 8.1  | 8.84  | 7.04 |
| 6   | 6  | 6  | 6  | 8  | 7.24  | 6.13 | 6.08  | 5.25 |
| 7   | 12 | 12 | 12 | 8  | 10.84 | 9.13 | 8.15  | 5.56 |

The 'inner' phrase anscombe[:Y1] .> 7.0 carries out an element-wise comparison of the values in column Y1, and returns an array of Boolean true or false values, one for each row. These are then used to select rows from the DataFrame. It's as if you'd entered:

julia> anscombe[[true, false, true, true, true, true, true, false, true, false, false], :]
7x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1    | Y2   | Y3    | Y4   |
|-----|----|----|----|----|-------|------|-------|------|
| 1   | 10 | 10 | 10 | 8  | 8.04  | 9.14 | 7.46  | 6.58 |
| 2   | 13 | 13 | 13 | 8  | 7.58  | 8.74 | 12.74 | 7.71 |
| 3   | 9  | 9  | 9  | 8  | 8.81  | 8.77 | 7.11  | 8.84 |
| 4   | 11 | 11 | 11 | 8  | 8.33  | 9.26 | 7.81  | 8.47 |
| 5   | 14 | 14 | 14 | 8  | 9.96  | 8.1  | 8.84  | 7.04 |
| 6   | 6  | 6  | 6  | 8  | 7.24  | 6.13 | 6.08  | 5.25 |
| 7   | 12 | 12 | 12 | 8  | 10.84 | 9.13 | 8.15  | 5.56 |

In a similar way, here's a result that contains every row where the value of the number in column Y1 is greater than that in column Y2:

julia> anscombe[anscombe[:Y1] .> anscombe[:Y2], :]
6x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1    | Y2   | Y3   | Y4   |
|-----|----|----|----|----|-------|------|------|------|
| 1   | 9  | 9  | 9  | 8  | 8.81  | 8.77 | 7.11 | 8.84 |
| 2   | 14 | 14 | 14 | 8  | 9.96  | 8.1  | 8.84 | 7.04 |
| 3   | 6  | 6  | 6  | 8  | 7.24  | 6.13 | 6.08 | 5.25 |
| 4   | 4  | 4  | 4  | 19 | 4.26  | 3.1  | 5.39 | 12.5 |
| 5   | 12 | 12 | 12 | 8  | 10.84 | 9.13 | 8.15 | 5.56 |
| 6   | 5  | 5  | 5  | 8  | 5.68  | 4.74 | 5.73 | 6.89 |

Combining two or more conditions is also possible. Here's a result consisting of rows where the value of Y1 is greater than 5.0 and that of Y2 is less than 7.0:

julia> anscombe[(anscombe[:Y1] .> 5.0) .& (anscombe[:Y2] .< 7.0), :]
2x8 DataFrame
| Row | X1 | X2 | X3 | X4 | Y1   | Y2   | Y3   | Y4   |
|-----|----|----|----|----|------|------|------|------|
| 1   | 6  | 6  | 6  | 8  | 7.24 | 6.13 | 6.08 | 5.25 |
| 2   | 5  | 5  | 5  | 8  | 5.68 | 4.74 | 5.73 | 6.89 |

Notice the . broadcasting operators.

Applying functions to columns and rows[edit]

You can apply a function to a column. To find out the mean of the values in the column named X2:

julia> mean(anscombe[:X2])
9.0

You can apply a function to a group of columns at the same time by using colwise(). For example, to apply the mean() function to every column of the DataFrame:

julia> colwise(mean, anscombe)
8-element Array{Any,1}:
 [9.0]    
 [9.0]    
 [9.0]    
 [9.0]    
 [7.50091]
 [7.50091]
 [7.5]    
 [7.50091]

which returns a new array containing the mean values for each column.

To apply a function to one or more specific columns, use the following syntax:

julia> colwise(mean, anscombe[[:X1, :X2, :Y3, :Y4]])
4-element Array{Any,1}:
 [9.0]    
 [9.0]    
 [7.5]    
 [7.50091]

The Dataframes package provides two convenient utilities, eachcol() and eachrow(). These can be used for iterating through every column or every row. Each value is a tuple of Symbol (column heading) and DataArray.

for col in eachcol(anscombe)
    println(col)
end
(:X1,[10,8,13,9,11,14,6,4,12,7,5])
(:X2,[10,8,13,9,11,14,6,4,12,7,5])
(:X3,[10,8,13,9,11,14,6,4,12,7,5])
(:X4,[8,8,8,8,8,8,8,19,8,8,8])
(:Y1,[8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68])
(:Y2,[9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74])
(:Y3,[7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73])
(:Y4,[6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89])

Here's the mean of each column, looking at the second part of each tuple provided by col:

for col in eachcol(anscombe)
    println(mean(col[2]))
end
9.0
9.0
9.0
9.0
7.500909090909093
7.500909090909091
7.500000000000001
7.50090909090909

The eachrow() function provides an iterator for rows:

for r in eachrow(anscombe)
    println(r)
end
DataFrameRow (row 1)
X1  10
X2  10
X3  10
X4  8
Y1  8.04
Y2  9.14
Y3  7.46
Y4  6.58

...

DataFrameRow (row 11)
X1  5
X2  5
X3  5
X4  8
Y1  5.68
Y2  4.74
Y3  5.73
Y4  6.89

Since each element of each row is a number, we could, in fact, in this case use eachrow() to find the (fairly meaningless) mean of each row:

for row in eachrow(anscombe)
    println(mean(convert(Array, row)))
end
8.6525
7.4525
10.47125
8.56625
9.358749999999999
10.492500000000001
6.3375
7.03125
9.71
6.92625
5.755000000000001

Plotting Anscombe's Quartet[edit]

Now let’s shift focus to statistics and compute the mean and standard deviation of every column, and the correlation coefficient (the Pearson product-moment correlation coefficient, or r) between x and y for each dataset. The functions we will use are mean(), median(), std(), and cor().

A first rough version of code to do this uses an anonymous function with two arguments, xcol and col, which are mapped over the X and Y columns of the DataFrame:

using StatsBase
println("Column\tMeanX\tMedianX\tStdDev X\tMeanY\t\t\tStdDev Y\t\tCorr\t")

map((xcol,ycol) -> println(
    xcol,                   "\t",
    mean(anscombe[xcol]),   "\t", 
    median(anscombe[xcol]), "\t", 
    std(anscombe[xcol]),    "\t", 
    mean(anscombe[ycol]),   "\t", 
    std(anscombe[ycol]),    "\t", 
    cor(anscombe[xcol], anscombe[ycol])), 
    
    [:X1, :X2, :X3, :X4], 
    [:Y1, :Y2, :Y3, :Y4]);
Column	MeanX	MedianX	StdDev X	MeanY			StdDev Y		Corr	
X1	9.0	9.0	3.3166247903554	7.500909090909093	2.031568135925815	0.81642051634484
X2	9.0	9.0	3.3166247903554	7.500909090909091	2.0316567355016177	0.8162365060002428
X3	9.0	9.0	3.3166247903554	7.500000000000001	2.0304236011236667	0.8162867394895981
X4	9.0	8.0	3.3166247903554	7.50090909090909	2.0305785113876023	0.8165214368885028

Anscombe’s Quartet comprises four datasets that have nearly identical simple statistical properties, but are very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of looking at your data, and graphing that data before relying on the summary statistics, and the effect of outliers on statistical properties.

As you can see from the output, each of the four datasets has the same mean, median, standard deviation, and correlation coefficient between x and y. If you just used the simple summary statistics to evaluate the four sets, you would assume that they were pretty similar.

By the way, there's a quicker and easier way to do this, if you don't want to write specific code and don't mind the formatting. Use the built-in describe() function:

julia> describe(anscombe)
X1
Summary Stats:
Mean:           9.000000
Minimum:        4.000000
1st Quartile:   6.500000
Median:         9.000000
3rd Quartile:   11.500000
Maximum:        14.000000
Length:         11
Type:           Int64
Number Missing: 0
% Missing:      0.000000

X2
Summary Stats:
Mean:           9.000000
Minimum:        4.000000
1st Quartile:   6.500000
Median:         9.000000
3rd Quartile:   11.500000
Maximum:        14.000000
Length:         11
Type:           Int64
Number Missing: 0
% Missing:      0.000000

X3
Summary Stats:
Mean:           9.000000
Minimum:        4.000000
1st Quartile:   6.500000
Median:         9.000000
3rd Quartile:   11.500000
Maximum:        14.000000
Length:         11
Type:           Int64
Number Missing: 0
% Missing:      0.000000

X4
Summary Stats:
Mean:           9.000000
Minimum:        8.000000
1st Quartile:   8.000000
Median:         8.000000
3rd Quartile:   8.000000
Maximum:        19.000000
Length:         11
Type:           Int64
Number Missing: 0
% Missing:      0.000000

Y1
Summary Stats:
Mean:           7.500909
Minimum:        4.260000
1st Quartile:   6.315000
Median:         7.580000
3rd Quartile:   8.570000
Maximum:        10.840000
Length:         11
Type:           Float64
Number Missing: 0
% Missing:      0.000000

Y2
Summary Stats:
Mean:           7.500909
Minimum:        3.100000
1st Quartile:   6.695000
Median:         8.140000
3rd Quartile:   8.950000
Maximum:        9.260000
Length:         11
Type:           Float64
Number Missing: 0
% Missing:      0.000000

Y3
Summary Stats:
Mean:           7.500000
Minimum:        5.390000
1st Quartile:   6.250000
Median:         7.110000
3rd Quartile:   7.980000
Maximum:        12.740000
Length:         11
Type:           Float64
Number Missing: 0
% Missing:      0.000000

Y4
Summary Stats:
Mean:           7.500909
Minimum:        5.250000
1st Quartile:   6.170000
Median:         7.040000
3rd Quartile:   8.190000
Maximum:        12.500000
Length:         11
Type:           Float64
Number Missing: 0
% Missing:      0.000000

Having looked at the summary statistics, it's time to reveal Anscombe's true purpose, by plotting the four sets of his quartet.

Needs updating for v0.7[edit]
using StatPlots
@df anscombe scatter([:X1 :X2 :X3 :X4], [:Y1 :Y2 :Y3 :Y4],
           smooth=true,
           line = :red,
           linewidth = 2,
           title= ["X$i vs Y$i" for i in (1:4)'],
           legend = false,
           layout = 4,
           xlimits = (2, 20),
           ylimits = (2, 14))

Plotting the anscombe relationships

Now we can see what Anscombe is saying. The simple summary statistics for the four datasets — mean, median, standard deviation, and correlation — are the same, and yet it's clear that they're quite different when plotted.

The first appears to show a simple linear relationship, corresponding to two variables correlated and following the assumption of normality.

The second set of points is not distributed normally; there is an obvious relationship between the two variables, but it isn't linear, and the Pearson correlation coefficient is not really relevant.

In the third set, the distribution is linear, but with a different regression line, which is offset by the one outlier which exerts enough influence to alter the regression line and lower the correlation coefficient from 1 to 0.816.

Finally, the fourth set shows an example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.

The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.

Regression and Models[edit]

If you want to find a linear regression line for the datasets, you can turn to the GLM (Generalized Linear Models) package.

using GLM, StatsModels

To create a linear model, you specify a formula using the @formula macro, supplying the column names and the name of the DataFrame. The result is a regression model.

julia> linearmodel = fit(LinearModel, @formula(Y1 ~ X1), anscombe)
 # v0.7 updates required
 StatsModels.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},
 GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
 
 Formula: Y1 ~ 1 + X1
 
 Coefficients:
              Estimate Std.Error t value Pr(>|t|)
 (Intercept)   3.00009   1.12475 2.66735   0.0257
 X1           0.500091  0.117906 4.24146   0.0022


Useful functions in the GLM package for working with linear models include summary(), and coef().

julia> summary(linearmodel)
"StatsModels.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},
 GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}, Array{Float64,2}}"

The coef() function returns the two useful coefficients that define the regression line: the estimated intercept and the estimated slope:

julia> coef(linearmodel)
2-element Array{Float64,1}:
 3.00009 
 0.500091

It's now easy to produce a function for the regression line in the form y = a x + c:

julia> f(x) = coef(linearmodel)[2] * x + coef(linearmodel)[1]
f (generic function with 1 method)

Now that we have f() as a function describing the regression line, it can be drawn in a plot. Here we plot the first series, and add a plot of the function f(x) with x running from 2 to 20, and see how it compares with the smoothing line we used earlier.

# Needs checking/fixing for v0.7
p1 = plot(anscombe[:X1], anscombe[:Y1], 
    smooth=true, 
    seriestype=:scatter, 
    title = "X1 vs Y1", 
    linewidth=8,
    linealpha=0.5,
    label="data")

plot!(f, 2, 20, label="correlation")

plotting the anscombe quartet with regression

Working with DataFrames[edit]

Not all datasets are as consistent and tidy as the examples in RDatasets. In the real world, it's possible that you'll read some data into a DataFrame, only to discover that it's got a few problems, with inconsistently formatted and/or missing elements.

For this section, we'll create a simple test DataFrame by hand, defining the columns one by one. It's a short extract from what might be a periodic table.

ptable = DataFrame(  Number       =   [1,   2,    6,    8,    26    ],
                     Name         =   ["Hydrogen",   "Helium",   "Carbon",   "Oxygen",   "Iron"   ],
                     AtomicWeight =   [1.0079,    4.0026,  12.0107, 15.9994, 55.845   ],
                     Symbol       =   ["H",    "He",    "C",    "O",  "Fe"  ],
                     Discovered   =   [1776,   1895,    0,    1774,    missing    ])
5x5 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered |
|-----|--------|------------|--------------|--------|------------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | 0          |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | Missing    |

The case of the missing value[edit]

A first look with showcols() reveals that one of the columns has been marked as containing a "missing" value. It's the column that contains the discovered year for Iron, which was marked as missing.

julia> showcols(ptable)
5×5 DataFrame
│ Col # │ Name         │ Eltype                │ Missing │ Values            │
├───────┼──────────────┼───────────────────────┼─────────┼───────────────────┤
│ 1     │ Number       │ Int64                 │ 0       │ 1  …  26          │
│ 2     │ Name         │ String                │ 0       │ Hydrogen  …  Iron │
│ 3     │ AtomicWeight │ Float64               │ 0       │ 1.0079  …  55.845 │
│ 4     │ Symbol       │ String                │ 0       │ H  …  Fe          │
│ 5     │ Discovered   │ Union{Missing, Int64} │ 2       │ 1776  …  missing  │

The problem of missing fields is one of the important issues you have to confront when working with tabular data. Sometimes, not every field in a table has data. For any number of reasons, there might be missing observations or data points. Unless you're aware of these missing values, they can cause problems. For example, mean and other statistical calculations will be incorrect if you don't notice and account for those strange numbers halfway down the table that someone used to indicate missing values. (Values of -99 in temperature datasets have been known to cause problems...) Also, it can be difficult to apply numeric formula to a mixture of numeric and string values.

You'll come across various ways in which the compilers of the data have indicated that the values are missing. Sometimes the values are represented by 0 or some other 'impossible' number, or by a text string such as "n/a". Sometimes — particularly with Tab and Comma-separated files, they're just left empty.

To address this issue, there's a special data type, Missing, which indicates that there isn't a usable value at this location. If used consistently, the DataFrames package and its support for Missing fields will allow you to get the most out of your datasets, while making sure your calculations are accurate and not 'poisoned' by missing values. The missing in the Iron/Discovered cell allows the DataFrames package to 'tag' that cell as being Missing.

But there's another problem that's not revealed here. The Discovered year of Carbon was set to 0, chosen to mean 'a long time ago', as it's not a valid year. (After 1 BC comes 1 AD, so Year 0 doesn't exist.) This 0 value should also be marked as being Missing in the DataFrame, before it causes havoc.

There are three approaches to making a DataFrame use missing values consistently:

- Edit the file/data outside Julia before importing, using a text editor or similar. This is sometimes the quickest and easiest way. In our simple example, we used the word missing in the file. This was enough to have the location marked as a Missing value.

- Use the options provided by the CSV package when importing the data. These let you specify rules for identifying certain values as Missing.

- Repair the file before you import it into a DataFrame.

How to fix missing values with readtable()[edit]

There's a lot of flexibility built in to CSV.read.

By default, any missing values (empty fields in the table) are replaced by missing. With the nastrings option, you can specify a group of strings that will all be converted to NA values if encountered in the source text:

pt1 = readtable("file.tsv", nastrings=["NA", "na", "n/a", "missing"])

This addresses most of the problems with the varying methods for indicating unavailable values in the original file. It's also possible to add "0" (0 in string form) to the list of nastrings, if it's the case that 0 isn't used for any legitimate values.

Using DataFrames with missing values[edit]

If a column contains one or more missing values, you'll find that some calculations don't work.

For example, you can apply a function to ordinary columns easily enough. So it's easy to calculate the mean of the AtomicWeight column:

julia> mean(ptable[:AtomicWeight])
17.77312

But, because there is a missing value in the Discovered column, you can't apply a function to it:

julia> mean(ptable[:Discovered])
missing

Because just one of the fields contains a missing value, the entire calculation is abandoned, and the missing value propagates back to the top level.

There are two ways to fix this: edit the table so that the missing value is converted to a real value, or, when running calculations, make sure that the relevant field isn't included.

We also have that Year 0 problem to address...

Looking for missing values and others in DataFrames[edit]

To look for missing values in a DataFrame, you can try writing code using the ismissing() function. This lets you test a DataFrame cell for missing values:

nrows, ncols = size(ptable)
for row in 1:nrows
      for col in 1:ncols
        if ismissing(ptable[row,col])
         println("$(names(ptable)[col]) value for $(ptable[row,:Name]) is missing!")
        end
      end
end
Discovered value for Iron is missing

You might think to use similar code to check for the 0 values for Discovered, and replace them with missing (should you consider that to be a good way to mark an element as being not-discovered in a particular year). But this doesn't work:

for row in 1:nrows
     if ptable[row, :Discovered] == 0
        println("it's zero")
     end
end

because:

TypeError: non-boolean (Missings.Missing) used in boolean context
 
Stacktrace:
 [1] macro expansion at ./In[129]:2 [inlined]
 [2] anonymous at ./<missing>:?
 [3] include_string(::String, ::String) at ./loading.jl:515

The problem here is that you're checking the value of each cell in a Boolean comparison, but you can't compare missing values with numbers: any comparison has to return "cannot compare them" rather than simply true or false.

Instead, you can write the loop like this:

for row in 1:nrows
    if ismissing(ptable[row, :Discovered])
        println("skipping missing values")
    else
        println("the value is $(ptable[row, :Discovered])")
    end
end
the value is 1776
the value is 1895
the value is 0
the value is 1774
skipping missing values

The ismissing() function is also useful in other contexts. This is a quick way of locating the index numbers of missing values in columns (notice the . broadcast operator):

julia> findall(ismissing.(ptable[:, :Discovered])) 
1-element Array{Int64,1}:
5

and this next line returns a new DataFrame of all rows where the Discovered column contains missing values:

julia> ptable[findall(ismissing.(ptable[:,:Discovered])), :] 
1x5 DataFrame
| Row | Number | Name   | AtomicWeight | Symbol | Discovered |
|-----|--------|--------|--------------|--------|------------|
| 1   | 26     | "Iron" | 55.845       | "Fe"   | missing    |

By using !ismissing you can return a DataFrame that contains no rows with missing values.

You can use ismissing() to select rows with missing values in specific columns and set them all to a new value. For example, this code finds missing discovery years and sets them to 0:

julia> ptable[ismissing.(ptable[:Discovered]), :Discovered] = 0
0
julia> ptable
5x5 DataFrames.DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered |
|-----|--------|------------|--------------|--------|------------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | 0          |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | 0          |

Repairing DataFrames[edit]

To clean your data, you can write a short fragment of code to change values that aren't acceptable. This code looks at every cell and changes any "n/a", "0", or 0 values to missing. Notice that the first test is ismissing() — that takes care of cases where the element is already a missing value; these are skipped (otherwise the comparisons that follow might fail).

for row in 1:size(ptable)[1]
   for col in 1:size(ptable)[2]
       println("processing row $row column $col ")
       temp = ptable[row,col]
       if ismissing(temp)
          println("skipping missing")
       elseif temp == "n/a" || temp == "0" || temp == 0
          ptable[row, col] = missing
          println("changed row $row column $col ")
       end
    end
end
processing row 1 column 1 
processing row 1 column 2 
processing row 1 column 3 
processing row 1 column 4 
processing row 1 column 5 
processing row 2 column 1 
processing row 2 column 2 
processing row 2 column 3 
processing row 2 column 4 
processing row 2 column 5 
processing row 3 column 1 
processing row 3 column 2 
processing row 3 column 3 
processing row 3 column 4 
processing row 3 column 5 
missing
skipping missing
processing row 4 column 1 
processing row 4 column 2 
processing row 4 column 3 
processing row 4 column 4 
processing row 4 column 5 
processing row 5 column 1 
processing row 5 column 2 
processing row 5 column 3 
processing row 5 column 4 
processing row 5 column 5 
missing
skipping missing
julia> ptable

5×5 DataFrames.DataFrame
│ Row │ Number │ Name     │ AtomicWeight │ Symbol │ Discovered │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┤
│ 1   │ 1      │ Hydrogen │ 1.0079       │ H      │ 1776       │
│ 2   │ 2      │ Helium   │ 4.0026       │ He     │ 1895       │
│ 3   │ 6      │ Carbon   │ 12.0107      │ C      │ missing    │
│ 4   │ 8      │ Oxygen   │ 15.9994      │ O      │ 1774       │
│ 5   │ 26     │ Iron     │ 55.845       │ Fe     │ missing    │

Now the Discovered column has two missing values, as the discovery date of Carbon is also now considered to be unknown.

Working with missing values: completecases() and dropmissing()[edit]

To find, say, the maximum value of the Discovered column (which we know contains missing values), you can use the complete_cases() function. This takes a DataFrame and returns flags to indicate which rows are valid. These can then be used to select rows which are guaranteed to not contain missing values:

julia> maximum(ptable[completecases(ptable), :][:Discovered])
1895

This allows you to write code that should work as expected, because rows with one or more missing values are excluded from consideration.

The dropmissing() function returns a copy of a data frame without missing values.

julia> dropmissing(ptable)
3×5 DataFrames.DataFrame
│ Row │ Number │ Name     │ AtomicWeight │ Symbol │ Discovered │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┤
│ 1   │ 1      │ Hydrogen │ 1.0079       │ H      │ 1776       │
│ 2   │ 2      │ Helium   │ 4.0026       │ He     │ 1895       │
│ 3   │ 8      │ Oxygen   │ 15.9994      │ O      │ 1774       │

So an alternative to the completecases() approach is:

julia> maximum(dropmissing(ptable)[:Discovered])
1895

Modifying DataFrames[edit]

Adding, deleting, and renaming columns[edit]

To add a column, you could do this:

hcat(ptable, collect(1:size(ptable,1)))

which adds another column of integers (which will be called :x1) from 1 to n. (This creates a copy of the DataFrame, and we haven't changed ptable or assigned the new DataFrame to a symbol.

Instead, let's add Melting and Boiling points of our chosen elements:

julia> ptable[:MP] = [-259, -272, 3500, -218, 1535]
julia> ptable[:BP] = [-253, -269, 4827, -183, 2750]
julia> ptable
5x7 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | MP   | BP   |
|-----|--------|------------|--------------|--------|------------|------|------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | missing    | 3500 | 4827 |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | missing    | 1535 | 2750 |

To illustrate how to create a new column based on things in the other columns, we'll add a column called Liquid showing for how many degrees C an element remains liquid (i.e. BP - MP):

julia> ptable[:Liquid] = map((x,y) -> y - x, ptable[:MP], ptable[:BP])
5-element DataArray{Int64,1}:
 6
 3
 1327
   35
 1215
julia> ptable
5x8 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | MP   | BP   | Liquid |
|-----|--------|------------|--------------|--------|------------|------|------|--------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | missing    | 3500 | 4827 | 1327   |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | missing    | 1535 | 2750 | 1215   |

To add or replace a column of a DataFrame with another column of data (ie an array of the right length), use:

julia> ptable[:Temp] = 1:size(ptable, 1)
1:5

Do it for real:

julia> ptable[:Temp] = map((x,y) -> y * x, ptable[:Liquid], ptable[:AtomicWeight])
5-element Array{Float64,1}:
  6.0474
 12.0078
 15938.2
   559.979
 67851.7
julia> ptable
5x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | MP   | BP   | Liquid | Temp    |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | 6.0474  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | 12.0078 |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | missing    | 3500 | 4827 | 1327   | 15938.2 |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | 559.979 |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | missing    | 1535 | 2750 | 1215   | 67851.7 |

First we added a new :Temp column with integers from 1 to n, then the values were replaced with the result of multiplying the atomic weight by the Liquid range. (It could have been done with just a single step, of course.)

You might want to add a column that shows the Melting Point in Fahrenheit:

julia> ptable[:MP_in_F] = map(deg -> 32 + (deg * 1.8), ptable[:MP])
5-element Array{Float64,1}:
 -434.2
 -457.6
 6332.0
 -360.4
 2795.0

julia> ptable
5x10 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | MP   | BP   | Liquid | Temp    | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | 6.0474  | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | 12.0078 | -457.6  |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | missing    | 3500 | 4827 | 1327   | 15938.2 | 6332.0  |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | 559.979 | -360.4  |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | missing    | 1535 | 2750 | 1215   | 67851.7 | 2795.0  |

It's easy to rename columns:

julia> rename!(ptable, :Temp => :Junk)

and

julia> rename!(ptable, f => t for (f, t) = zip([:MP, :BP], [:Melt, :Boil]))
5x10 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | Junk    | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | 6.0474  | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | 12.0078 | -457.6  |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | missing    | 3500 | 4827 | 1327   | 15938.2 | 6332.0  |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | 559.979 | -360.4  |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | missing    | 1535 | 2750 | 1215   | 67851.7 | 2795.0  |

(There's also a rename() function, without the exclamation mark, which doesn't change the original DataFrame.)

And to delete a column, use delete!():

julia> delete!(ptable, :Junk)

deletes (notice that ! warning) the named column. You can also delete a set of columns.

Adding and deleting rows[edit]

It's easy to add rows. Use push! with some suitable data of the right length and type:

julia> push!(ptable, [29, "Copper", 63.546, "Cu", missing, 1083, 2567, 2567-1083, map(deg -> 32 + (deg * 1.8), 1083)])
julia> ptable
6x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | missing    | 3500 | 4827 | 1327   | 6332.0  |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | -360.4  |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | missing    | 1535 | 2750 | 1215   | 2795.0  |
| 6   | 29     | "Copper"   | 63.546       | "Cu"   | missing    | 1083 | 2567 | missing| missing |

Those missing values should be replaced soon using the functions from earlier. We can locate the new element by name and change the value for :Liquid like this:

julia> ptable[[ismatch(r"Copper", elementname) for elementname in ptable[:Name]], :][:Liquid] = 2567 - 1083
1×9 DataFrames.DataFrame
│ Row │ Number │ Name   │ AtomicWeight │ Symbol │ Discovered │ Melt │ Boil │ Liquid │ MP_in_F │
├─────┼────────┼────────┼──────────────┼────────┼────────────┼──────┼──────┼────────┼─────────┤
│ 1   │ 29     │ Copper │ 63.546       │ Cu     │ missing    │ 1083 │ 2567 │ 1484   │ 1981.4  │

or we could use the atomic number to obtain access to the right row and do it that way:

julia> ptable[ptable[:, :Number] .== 6, :][:Liquid] = 4827 - 3500
1327

To delete rows, use the deleterows!() function, with one or more row specifiers:

julia> temp = deepcopy(ptable)  # so as not to change ptable 
julia> deleterows!(temp, 3:5)
3x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  |
| 3   | 29     | "Copper"   | 63.546       | "Cu"   | missing    | 1083 | 2567 | missing| missing |

or:

julia> temp = deepcopy(ptable)
julia> deleterows!(temp, [3,4,6])
3x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  |
| 3   | 26     | "Iron"     | 55.845       | "Fe"   | missing    | 1535 | 2750 | 1215   | 2795.0  |

Alternatively, you could delete rows by specifying a condition. For example, to delete rows where the Boiling Point is less than 100 C, you could just find the rows that are greater than or equal to 100 C, then assign a variable to the result:

julia> ptable1 = ptable[ptable[:Boil] .> 100, :]
3x9 DataFrame
| Row | Number | Name     | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|----------|--------------|--------|------------|------|------|--------|---------|
| 1   | 6      | "Carbon" | 12.0107      | "C"    | missing    | 3500 | 4827 | 1327   | 6332.0  |
| 2   | 26     | "Iron"   | 55.845       | "Fe"   | missing    | 1535 | 2750 | 1215   | 2795.0  |
| 3   | 29     | "Copper" | 63.546       | "Cu"   | missing    | 1083 | 2567 | missing| missing |

Finding values in DataFrames[edit]

To find values, the basic idea is to use an elementwise operator or function that examines all rows and returns an array of Boolean values to indicate whether each cell meets the criteria for each row:

julia> ptable[:, :Melt] .< 100
6-element BitArray{1}:
  true
  true
 false
  true
 false
 false

then use this Boolean array to select the rows:

julia> ptable[ptable[:, :Melt] .< 100, :]
3x9 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  |
| 3   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | -360.4  |

You could use something like this to return rows where a column matches a regular expression:

julia> ptable[[occursin(r"Co", elementname) for elementname in ptable[:Name]], :]
1×9 DataFrames.DataFrame
│ Row │ Number │ Name     │ AtomicWeight │ Symbol │ Discovered │ Melt │ Boil │ Liquid │ MP_in_F │
|-----|--------|----------|--------------|--------|------------|------|------|--------|---------|
│ 1   │ 6      │ "Copper" │ 63.546       │ "Cu"   │ missing    │ 1083 │ 2567 │ missing│ missing │

and you can edit elements in the same way:

julia> ptable[[occursin(r"Copper", elementname) for elementname in ptable[:Name]], :][:Liquid] = 2567 - 1083
1484

Here's a function to look for types in a specific column:

julia> ptable[[(typeof(element) == Missings.Missing) for element in ptable[:, :Discovered]], :]
3×9 DataFrames.DataFrame
│ Row │ Number │ Name   │ AtomicWeight │ Symbol │ Discovered │ Melt │ Boil │ Liquid │ MP_in_F │
├─────┼────────┼────────┼──────────────┼────────┼────────────┼──────┼──────┼────────┼─────────┤
│ 1   │ 6      │ Carbon │ 12.0107      │ C      │ missing    │ 3500 │ 4827 │ 1327   │ 6332.0  │
│ 2   │ 26     │ Iron   │ 55.845       │ Fe     │ missing    │ 1535 │ 2750 │ 1215   │ 2795.0  │
│ 3   │ 29     │ Copper │ 63.546       │ Cu     │ missing    │ 1083 │ 2567 │ 1484   │ 1981.4  │

To find matching entries:

julia> df[contains.("Copper", df[:Name]), :]
julia> df[ismatch.(r"C.*", df[:Name]), :]

or:

julia> df[contains.(df[:Name], r"Co"), :]

Subsets and groups[edit]

To investigate subsets and groupings, we'll add another column:

julia> ptable[:Room] = [:Gas, :Gas, :Solid, :Gas, :Solid, :Solid]
6-element Array{Symbol,1}:
 :Gas  
 :Gas  
 :Solid
 :Gas
 :Solid
 :Solid
6x10 DataFrame
| Row | Number | Name       | AtomicWeight | Symbol | Discovered | Melt | Boil | Liquid | MP_in_F | Room  |
|-----|--------|------------|--------------|--------|------------|------|------|--------|---------|-------|
| 1   | 1      | "Hydrogen" | 1.0079       | "H"    | 1776       | -259 | -253 | 6      | -434.2  | Gas   |
| 2   | 2      | "Helium"   | 4.0026       | "He"   | 1895       | -272 | -269 | 3      | -457.6  | Gas   |
| 3   | 6      | "Carbon"   | 12.0107      | "C"    | missing    | 3500 | 4827 | 1327   | 6332.0  | Solid |
| 4   | 8      | "Oxygen"   | 15.9994      | "O"    | 1774       | -218 | -183 | 35     | -360.4  | Gas   |
| 5   | 26     | "Iron"     | 55.845       | "Fe"   | missing    | 1535 | 2750 | 1215   | 2795.0  | Solid |
| 6   | 29     | "Copper"   | 63.546       | "Cu"   | missing    | 1083 | 2567 | 1484   | 1981.4  | Solid |

This column gives the state of each element at room temperature.

It's now possible to collect up the elements according to their group membership. The groupby() function splits the original DataFrame into groups of subDataFrames according to the values in the named column. For example, with three elements that are gases at room temperature, and the others which are solid, we can obtain two subDataFrames:

julia> groupby(ptable, [:Room])
DataFrames.GroupedDataFrame  2 groups with keys: Symbol[:Room]
First Group:
2×10 DataFrames.SubDataFrame{Array{Int64,1}}. Omitted printing of 3 columns
│ Row │ Number │ Name     │ AtomicWeight │ Symbol │ Discovered │ Melt │ Boil │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┼──────┼──────┤
│ 1   │ 1      │ Hydrogen │ 1.0079       │ H      │ 1776       │ -259 │ -253 │
│ 2   │ 2      │ Helium   │ 4.0026       │ He     │ 1895       │ -272 │ -269 │
⋮
Last Group:
4×10 DataFrames.SubDataFrame{Array{Int64,1}}. Omitted printing of 3 columns
│ Row │ Number │ Name   │ AtomicWeight │ Symbol │ Discovered │ Melt │ Boil │
├─────┼────────┼────────┼──────────────┼────────┼────────────┼──────┼──────┤
│ 1   │ 6      │ Carbon │ 12.0107      │ C      │ missing    │ 3500 │ 4827 │
│ 2   │ 8      │ Oxygen │ 15.9994      │ O      │ 1774       │ -218 │ -183 │
│ 3   │ 26     │ Iron   │ 55.845       │ Fe     │ missing    │ 1535 │ 2750 │
│ 4   │ 29     │ Copper │ 63.546       │ Cu     │ missing    │ 1083 │ 2567 │

Another function, by(), lets you group the rows and then apply a function to one of the fields of every row in the group. In this next example, the anonymous function finds the mean value of all the values for the Melting point, for each group in turn. The result is a single value for each group — the mean melting point of all the gases, and the mean melting point of all the solid elements:

julia> by(ptable, [:Room], df -> mean(df[:Melt]))
2×2 DataFrames.DataFrame
│ Row │ Room  │ x1       │
├─────┼───────┼──────────┤
│ 1   │ Gas   │ -249.667 │
│ 2   │ Solid │ 2039.33  │

Sorting[edit]

The by() function can also be used for counting and sorting DataFrames by group. The nrow() function creates a new row:

julia> by(ptable, :Room, nrow)
2x2 DataFrame
| Row | Room  | x1 |
|-----|-------|----|
| 1   | Gas   | 3  |
| 2   | Solid | 3  |

The sort!()function works with DataFrames as well. You supply the columns on which to sort, using the following syntax:

 julia> sort!(ptable, [order(:Room), order(:AtomicWeight)])
6×10 DataFrames.DataFrame
│ Row │ Number │ Name     │ AtomicWeight │ Symbol │ Discovered │ Melt │ Boil │ Liquid │ MP_in_F │ Room  │
├─────┼────────┼──────────┼──────────────┼────────┼────────────┼──────┼──────┼────────┼─────────┼───────┤
│ 1   │ 1      │ Hydrogen │ 1.0079       │ H      │ 1776       │ -259 │ -253 │ 6      │ -434.2  │ Gas   │
│ 2   │ 2      │ Helium   │ 4.0026       │ He     │ 1895       │ -272 │ -269 │ 3      │ -457.6  │ Gas   │
│ 3   │ 8      │ Oxygen   │ 15.9994      │ O      │ 1774       │ -218 │ -183 │ 35     │ -360.4  │ Gas   │
│ 4   │ 6      │ Carbon   │ 12.0107      │ C      │ missing    │ 3500 │ 4827 │ 1327   │ 6332.0  │ Solid │
│ 5   │ 26     │ Iron     │ 55.845       │ Fe     │ missing    │ 1535 │ 2750 │ 1215   │ 2795.0  │ Solid │
│ 6   │ 29     │ Copper   │ 63.546       │ Cu     │ missing    │ 1083 │ 2567 │ 1484   │ 1981.4  │ Solid │

The resulting DataFrame is sorted first by its state at room temperature (so Gas before Solid), then by its Atomic Weight (so Iron before Copper).