Data Mining Algorithms In R/Classification/kNN
This chapter introduces the k-Nearest Neighbors (kNN) algorithm for classification. kNN, originally proposed by Fix and Hodges  is a very simple 'instance-based' learning algorithm. Despite its simplicity, it can offer very good performance on some problems. We present a high level overview of the algorithm, explaining the relevant parameters and implementation choices, followed by a step by step case study.
The kNN algorithm, like other instance-based algorithms, is unusual from a classification perspective in its lack of explicit model training. While a training dataset is required, it is used solely to populate a sample of the search space with instances whose class is known. No actual model or learning is performed during this phase; for this reason, these algorithms are also known as lazy learning algorithms. Different distance metrics can be used, depending on the nature of the data. Euclidean distance is typical for continuous variables, but other metrics can be used for categorical data. Specialized metrics are often useful for specific problems, such as text classification. When an instance whose class is unknown is presented for evaluation, the algorithm computes its k closest neighbors, and the class is assigned by voting among those neighbors. To prevent ties, one typically uses an odd choice of k for binary classification. For multiple classes, one can use plurality voting or majority voting. The latter can sometimes result in no class being assigned to an instance, while the former can result in classifications being made with very low support from the neighborhood. One can also weight each neighbor by an inverse function of its distance to the instance being classified. The main advantages of kNN for classification are:
- Very simple implementation.
- Robust with regard to the search space; for instance, classes don't have to be linearly separable.
- Classifier can be updated online at very little cost as new instances with known classes are presented.
- Few parameters to tune: distance metric and k.
The main disadvantages of the algorithm are:
- Expensive testing of each instance, as we need to compute its distance to all known instances. Specialized algorithms and heuristics exist for specific problems and distance functions, which can mitigate this issue. This is problematic for datasets with a large number of attributes. When the number of instances is much larger than the number of attributes, a R-tree or a kd-tree can be used to store instances, allowing for fast exact neighbor identification.
- Sensitiveness to noisy or irrelevant attributes, which can result in less meaningful distance numbers. Scaling and/or feature selection are typically used in combination with kNN to mitigate this issue.
- Sensitiveness to very unbalanced datasets, where most entities belong to one or a few classes, and infrequent classes are therefore often dominated in most neighborhoods. This can be alleviated through balanced sampling of the more popular classes in the training stage, possibly coupled with ensembles.
The training phase for kNN consists of simply storing all known instances and their class labels. A tabular representation can be used, or a specialized structure such as a kd-tree. If we want to tune the value of 'k' and/or perform feature selection, n-fold cross-validation can be used on the training dataset. The testing phase for a new instance 't', given a known set 'I' is as follows:
- Compute the distance between 't' and each instance in 'I'
- Sort the distances in increasing numerical order and pick the first 'k' elements
- Compute and return the most frequent class in the 'k' nearest neighbors, optionally weighting each instance's class by the inverse of its distance to 't'
There are at least three implementations of kNN classification for R, all available on CRAN:
- RWeka, which is a bridge to the popular WEKA  machine and datamining toolkit, and provides a kNN implementation as well as dozens of algorithms for classification, clustering, regression, and data engineering.
We choose RWeka for this tutorial, as it provides a lot more than simply kNN classification, and the sample session shown below can be used to generate other classifiers with little modification.
Installing and Running RWeka
RWeka is a CRAN package, so it can be installed from within R:
> install.packages("RWeka", dependencies = TRUE)
Most R graphical user interfaces also provide package management through their UIs. Once installed, RWeka can be loaded in as a library:
It comes with several well-known datasets, which can be loaded in as ARFF files (Weka's default file format). We now load a sample dataset, the famous Iris dataset , and learn a kNN classifier for it, using default parameters:
> iris <- read.arff(system.file("arff", "iris.arff", package = "RWeka")) > classifier <- IBk(class ~., data = iris) > summary(classifier) === Summary === Correctly Classified Instances 150 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0.0085 Root mean squared error 0.0091 Relative absolute error 1.9219 % Root relative squared error 1.9335 % Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 50 0 | b = Iris-versicolor 0 0 50 | c = Iris-virginica
While in the above session we have only used default parameters, RWeka allows us to customize the KNN classifier in several ways, aside from setting the value of k:
- We can weight neighbors by the inverse of their distance to the target instance, or by 1 - distance.
- We can use leave-one-out cross-validation to choose the optimal value for k in the training data.
- We can use a sliding window of training instances, so once the classifier knows about W instances, older instances are dropped as new ones are added.
- We can customize the way the algorithm stores the known instances, which allows us to use kd-trees and similar data structures for faster neighbor search in large datasets.
We will now perform a more detailed exploration of the Iris dataset, using cross-validation for real test statistics, and also performing some parameter tuning.
The Iris dataset contains 150 instances, corresponding to three equally-frequent species of iris plant (Iris setosa, Iris versicolour, and Iris virginica). An Iris versicolor is shown below, courtesy of Wikimedia Commons.
Each instance contains four attributes:sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. The next picture shows each attribute plotted against the others, with the different classes in color.
Execution and Results
We will generate a kNN classifier, but we'll let RWeka automatically find the best value for k, between 1 and 20. We'll also use 10-fold cross validation to evaluate our classifier:
> classifier <- IBk(class ~ ., data = iris, control = Weka_control(K = 20, X = TRUE)) > evaluate_Weka_classifier(classifier, numFolds = 10) === 10 Fold Cross Validation === === Summary === Correctly Classified Instances 142 94.6667 % Incorrectly Classified Instances 8 5.3333 % Kappa statistic 0.92 Mean absolute error 0.041 Root mean squared error 0.1414 Relative absolute error 9.2339 % Root relative squared error 29.9987 % Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = Iris-setosa 0 46 4 | b = Iris-versicolor 0 4 46 | c = Iris-virginica > classifier IB1 instance-based classifier using 6 nearest neighbour(s) for classification
As cross-validation generates random partitions of the dataset, your results may vary. Typically, however, the model will make fewer than 10 mistakes.
This simple case study shows that a kNN classifier makes few mistakes in a dataset that, although simple, is not linearly separable, as shown in the scatterplots and by a look at the confusion matrix, where all misclassifications are between Iris Versicolor and Iris Virginica instances. The case study also shows how RWeka makes it trivially easy to learn classifiers (and predictors, and clustering models as well), and to experiment with their parameters. In this classic dataset, a kNN classifier makes very few mistakes.
- ^ Fix, E., Hodges, J.L. (1951); Discriminatory analysis, nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
- ^ Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1).
- ^ Fisher,R.A. (1936); The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, Part II, 179-188.