Data Mining Algorithms In R/Classification/Outliers
Outlier detection is one of the most important tasks in data analysis. In this approach, an expert can explore a set of associative rules in order to find how much the interestingness measure of these rules are away from their average values in different subsets of the database. The threshold which divides abnormal and non-abnormal data numerically is often the basis for important decisions. Most of the methods for univariate outlier detection are based on (robust) estimation of location and scatter or on quantiles of the data. A major disadvantage is that these rules are independent from the sample size. The dependence from the sample size is desirable to allow the threshold to be fitted according to the sample size. Moreover, outliers are identified even for “clean” data, or at least no distinction is made between outliers and extremes of a distribution.
Discovery methods for interesting exception rules can be divided into two approaches from the viewpoint of background knowledge:
- In a directed approach, a method is first provided with background knowledge typically in the form of rules, then the method obtains exception rules each of which deviates from these rules;
- In an undirected approach, on the other hand, no background knowledge is provided.
The problem can be summarized as finding a set of rule pairs each of which consists of an exception rule associated with a strong rule. Suppose a strong rule is represented by if Y then x", where Y = y1 ^ y2 ^ ... ^ yn is a conjunction of atoms and x is a single atom. Let Z = z1 ^ z2 ^ ... ^ zn be a conjunction of atoms and x' be a single atom which has the same attribute but a value different to the atom x, then the exception rule is represented by if Y and Z then x". For instance, consider the rule "using a seat belt is risky for a child", which represents exceptions to the well-known fact "using a seat belt is safe".
- Step1: If you already have the R package installed in your system jump to Step2. To install the R package you can use your system apt-get capabilities, just typing the following command:
$ sudo apt-get install r-base-core
If your system does not have apt-get capabilities, don't give up! You can download the package by visiting the Outlier Package website.
- Step2: It is necessary to install the mvoutlier package. In order to install the mvoutlier package you first need to run R. It can be done by the command:
Then type the following command in the R environment:
The installation is done.
Now it is necessary to load the package:
In order to show how we can visualize the results of the mvoutlier package, we will use a practical example. The data set and its use in mvoutlier are described below.
Swiss Fertility and Socioeconomic Indicators (1888) Data
Swiss is a database that contains standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
A data frame with 47 observations on 6 variables, each of which is in percent.
|Fertility||Ig, ‘common standardized fertility measure’|
|Agriculture||% of males involved in agriculture as occupation|
|Examination||% draftees receiving highest mark on army examination|
|Education||% education beyond primary school for draftees.|
|Catholic||% "catholic" (as opposed to "protestant").|
|Infant.Mortality||live births who live less than 1 year.|
All variables but "Fertility" give proportions of the population.
To find the outliers of this dataset just type these two commands below:
> data(swiss) > uni.plot(swiss)
The above commands will generate the following figure:
The exceptions are found by analyzing the correlation among each feature represented by the columns. For example, the red points next to the value 0 would not be outliers if they were analyzed separately, but as the correlation is considered the points are outliers. Also it is important to point out that the outliers from a column are the same in the others.
The y-axis represents the robust Mahalanobis distance based on the mcd estimator. The zero point indicates the statistical average of the values.
A more detailed output can be reached with the following commands:
> data(swiss) > uni.plot(swiss, symb=TRUE)
In the first picture there are only two colors and no special symbols are used. Outliers are marked red.
In the second picture we have set the argument symbol TRUE. In this case, different symbols (cross means big value, circle means little value) according to the robust Mahalanobis distance based on the mcd estimator and different colors (red means big value, blue means little value) according to the euclidean distances of the observations are used.
Besides highlight the outliers in the figure, a table is generated to identify which elements correspond to the outliers highlighted. In this table, the elements marked as TRUE are the outliers. An example of this table is shown in the Section #Case_Study .
Suppose you want to buy an antique car, because you're a famous collector. You have a list with many characteristics of each car.
A car that stands out would be a good idea, but a car that "stands out" can be very good or very bad. So which car to buy?
The dataset that we are going to use in this case study, called mtcars, was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
A data frame with 32 observations on 11 variables.
|cyl||Number of cylinders|
|drat||Rear axle ratio|
|qsec||1/4 mile time|
|am||Transmission (0 = automatic, 1 = manual)|
|gear||Number of forward gears|
|carb||Number of carburetors|
As a collector, you are only interested in three characteristics: mpg, qseq and hp. So a filter need to be done in the dataset. Moreover, you are only interested in the first 15 cars of the list, because you already have the others :)
Loading the dataset:
Filtering the dataset:
> cars = mtcars[1:15, c("mpg", "qsec", "hp")]
To see what we have filtered:
Finding the stands out cars:
The "log" in the above command is used to put y-axis in logarithmic scale.
The first figure shows us that we have two outliers, i.e., two cars that stands out. In the figure, these cars are represented by the red points. The outliers in a column are the same in the others.
It is worth noting that exceptions are found by analyzing the correlation among each feature of the car. For example, the red dot next to the value 0 in column qsec would not be an outlier if it were analyzed separately, but as the correlation is considered the point is an outlier.
To find out the cars that stands out we analyze the second figure. Clearly we can see the outliers marked as TRUE. So the outliers are Merc 230 and Cadillac Fleetwood. Now the collector life is easier. It only remains to identify what kind of "stand out" the collector is interested. Looking the #Filtered Dataset, we can note that the car Merc 230 is very economical, but slower and less powerful and we can see too that the Cadillac Fleetwood is very fast and powerful, but consume too much fuel. As a good collector, he will likely choose the Cadillac Fleetwood, because it was probably the most desired car.
Congratulations!!! You have just bought this amazing car!!!
- ^ P. Filzmoser. A multivariate outlier detection method. In S. Aivazian, P. Filzmoser, and Yu. Kharin, editors, Proceedings of the Seventh International Conference on Computer Data Analysis and Modeling, volume 1, pp. 18-22, Belarusian State University, Minsk, 2004.
- ^ GONÇALVES, E. C. ; ALBUQUERQUE, C. V. N. ; PLASTINO, A. . Mineração de Exceções Aplicada aos Sistemas para Detecção de Intrusoes. Revista Eletrônica de Sistemas de Informação, v. 8, p. 1-9, 2006.
- ^ Outlier Package
- ^ Outlier Manual
- ^ Swiss
- ^ Mahalanobis Distance
- ^ mtcars dataset