# Data Mining Algorithms In R/Classification/Naïve Bayes

## Introduction

This chapter introduces the Naïve Bayes algorithm for classification. Naïve Bayes (NB) based on applying Bayes' theorem (from probability theory) with strong (naive) independence assumptions. It is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

## Naïve Bayes

Naive Bayes classifiers can handle an arbitrary number of independent variables whether continuous or categorical. Given a set of variables, ${\displaystyle X}$ = {${\displaystyle x_{1},x_{2},x_{3}...,x_{d}}$}, we want to construct the posterior probability for the event ${\displaystyle C_{j}}$ among a set of possible outcomes ${\displaystyle C}$ = {${\displaystyle c_{1},c_{2},c_{3}...,c_{n}}$}. In a more familiar language, ${\displaystyle X}$ is the predictors and ${\displaystyle C}$ is the set of categorical levels present in the dependent variable. Using Bayes' rule:

${\displaystyle p(C\vert x_{1},\dots ,x_{d})={\frac {p(C)\ p(x_{1},\dots ,x_{d}\vert C)}{p(x_{1},\dots ,x_{d})}}.\,}$

where ${\displaystyle p(C_{j}\vert x_{1},\dots ,x_{d})}$ is the posterior probability of class membership, i.e., the probability that ${\displaystyle X}$ belongs to ${\displaystyle C_{j}}$.

In practice we are only interested in the numerator of that fraction, since the denominator does not depend on ${\displaystyle C}$ and the values of the features ${\displaystyle x_{i}}$ are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability:

${\displaystyle p(C,x_{1},\dots ,x_{d})=p(C)\ p(x_{1}\vert C)\ p(x_{2}\vert C,x_{1})\ p(x_{3}\vert C,x_{1},x_{2})\ \dots p(x_{d}\vert C,x_{1},x_{2},x_{3},\dots ,x_{d-1}).}$

The "naive" conditional independence assumptions come into play: assume that each feature ${\displaystyle x_{i}}$ is conditionally statistical independent of every other feature ${\displaystyle x_{j}}$ for ${\displaystyle j\neq i}$. This means that

${\displaystyle p(x_{i}\vert C,x_{j})=p(x_{i}\vert C)\,}$

for ${\displaystyle i\neq j}$, and so the joint model can be expressed as

${\displaystyle p(C,x_{1},\dots ,x_{d})=p(C)\ p(x_{1}\vert C)\ p(x_{2}\vert C)\ p(x_{3}\vert C)\ \cdots \,}$
${\displaystyle =p(C)\prod _{i=1}^{d}p(x_{i}\vert C).\,}$

This means that under the above independence assumptions, the conditional distribution over the class variable ${\displaystyle C}$ can be expressed like this:

${\displaystyle p(C\vert x_{1},\dots ,x_{d})={\frac {1}{Z}}p(C)\prod _{i=1}^{d}p(x_{i}\vert C)}$

where ${\displaystyle Z}$ (the evidence) is a scaling factor dependent only on ${\displaystyle x_{1},\dots ,x_{d}}$, i.e., a constant if the values of the feature variables are known.

Finally, we can label a new case F with a class level ${\displaystyle C_{j}}$ that achieves the highest posterior probability:

${\displaystyle \mathrm {classify} (F_{1},\dots ,F_{d})={\underset {c}{\operatorname {argmax} }}\ p(C=c)\displaystyle \prod _{i=1}^{d}p(x_{i}=F_{i}\vert C=c).}$

### Available Implementations

There are at least two R implementations of Naïve Bayes classification available on CRAN:

### Installing and Running the Naïve Bayes Classifier

E1071 is a CRAN package, so it can be installed from within R:

> install.packages('e1071', dependencies = TRUE)


Once installed, e1071 can be loaded in as a library:

> library(class)
> library(e1071)


It comes with several well-known datasets, which can be loaded in as ARFF files (Weka's default file format). We now load a sample dataset, the famous Iris dataset [1] and learn a Naïve Bayes classifier for it, using default parameters. First, let us take a look at the Iris dataset.

### Dataset

The Iris dataset contains 150 instances, corresponding to three equally-frequent species of iris plant (Iris setosa, Iris versicolour, and Iris virginica). An Iris versicolor is shown below, courtesy of Wikimedia Commons.

Iris versicolor

Each instance contains four attributes:sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. The next picture shows each attribute plotted against the others, with the different classes in color.

> pairs(iris[1:4], main = "Iris Data (red=setosa,green=versicolor,blue=virginica)",
pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])  Plotting the Iris attributes ### Execution and Results First of all, we need to specify which base we are going to use: > data(iris) > summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50  After that, we are ready to create a Naïve Bayes model to the dataset using the first 4 columns to predict the fifth. (Factor the target column by so: dataset$col <- factor(dataset\$col) )


> classifier<-naiveBayes(iris[,1:4], iris[,5])
> table(predict(classifier, iris[,-5]), iris[,5])

setosa versicolor virginica
setosa         50          0         0
versicolor      0         47         3
virginica       0          3        47



### Analysis

This simple case study shows that a Naïve Bayes classifier makes few mistakes in a dataset that, although simple, is not linearly separable, as shown in the scatterplots and by a look at the confusion matrix, where all misclassifications are between Iris Versicolor and Iris Virginica instances.

### References

1. ^ Fisher,R.A. (1936); The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, Part II, 179-188.