Data Mining Algorithms In R/Clustering/Density-Based Clustering

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Clustering techniques have an important role in class identification of records on a database, therefore it’s been established as one of the main topics of research in data mining. Spatial clustering techniques are a subset of clustering techniques applied on databases whose records have attributes intrinsically related to some spatial semantics. Most of the traditional clustering techniques described in the previous chapters can be applied to spatial databases. However, when applied to tasks like class identification on a spatial context, the traditional techniques might be unable to achieve good results, e.g. elements in the same cluster might not share enough similarity or the performance may be prohibitively poor.

The class identification task assisted by spatial clustering algorithms has a wide range of applications as finding relevant information on increasingly large spatial databases have recently become a highly demanded task. Examples include geographic data from satellite images, medical x-ray image processing or pattern recognition in machine learning samples.

Although there have been proposed an extensive number of techniques for clustering space-related data, many of the traditional clustering algorithm specified by them suffer from a number of drawbacks. Firstly, techniques based on k-partitioning such as those based on k-means are restricted to clusters structured on a convex-shaped fashion. Many databases have clusters with a broad variety of shapes (Figure 1), hence the traditional k-partitioning algorithms will fail to produce satisfactory results. Secondly, most techniques require previous knowledge of the database (domain knowledge) to determine the best input parameters. For example, k-means takes as input the number of expected clusters, k, which must be previously known for whichever database it’s applied on. In many real-life databases there is not an a priori domain knowledge and therefore choosing parameters values based on guesses will probably lead to incomplete and undesirable results. Finally, most techniques are not scalable, which means they can not be used for large databases, such as those made up by hundreds of thousands of elements. Although most techniques derives algorithms belonging to a polynomial run time complexity class, the cost can become prohibitively large when they are applied on huge databases, e.g. millions or billions of elements.

The restrictions mentioned above can be overcome by using a new approach, which is based on density for deciding which clusters each element will be in. The next session will introduce this new approach, DBSCAN, which stands for density-based algorithm for discovering clusters in large spatial databases with noise.

DBSCAN: Density Based Spatial Clustering of Applications with Noise[edit | edit source]

The idea behind constructing clusters based on the density properties of the database is derived from a human natural clustering approach. By looking at the two-dimensional database showed in figure 1, one can almost immediately identify three clusters along with several points of noise. The clusters and consequently the classes are easily and readily identifiable because they have an increased density with respect to the points they possess. On the other hand, the single points scattered around the database are outliers, which means they do not belong to any clusters as a result of being in an area with relatively low concentration.

File:Two-Dimensional Database With Three Clusters.PNG
Figure 1: A two-dimensional spatial database with three non-convex shaped clusters and several outliers.

Furthermore, as will be explained in the following sections, the DBSCAN algorithm requires at most two parameters: a density metric and the minimum size of a cluster. As a result, estimating the number of clusters a priori is not a need, as opposed to other techniques, namely k-means. Finally, as will be demonstrated later, the DBSCAN is efficient even when applied on large databases.

Important Concepts[edit | edit source]

Prior to describing the DBSCAN algorithm, some concepts must be explained to its fully understanding. The elements of the database can be classified in two different types: the border points, the points located on the extremities of the cluster, and the core points, which are located on its inner region. A neighborhood of a point p is a set of all points that have a distance measure less than a predetermined value, called Eps. Therefore, the neighborhood size of the core points is generally bigger than that of the border points. A point p is directly density-reachable from another point q if it belongs to the neighborhood of q and q's neighborhood size is greater than a given constant MinPts. By deriving canonically the previous concept, one can define a generic density-reachability: a point p is density-reachable from q if there exists a chain of points p1,..., pn, where p1 = q, pn = p, and pi+1 is directly density-reachable from pi. The former concepts are then combined to the definition of a cluster D:

  • q,p : if q D and if q is density-reachable from p and p's neighborhood is greater than the MinPts threshold, then p belongs to D.
  • q,p : if q D and p D then there must be a point t such that q and p are directly reachable from t.

According to the above definition, there might be some points not belonging to any of the generated clusters, those points are outliers (noise). In other words, a cluster D is formed by a set of points that respects a certain degree of concentration, which is set by the MinPts and Eps constraints. By adjusting those values, one can find clusters of varying shapes and densities.

In other words, a cluster D is formed by a set of points that respects a certain degree of concentration, which is set by the MinPts and Eps constraints. By adjusting those values, one can and clusters of varying shapes and densities.

Algorithm[edit | edit source]

The DBSCAN algorithm (Algorithm 1) starts by randomly selecting an element P from the database. If P is not a core point, i.e. P has fewer than MinPts neighbors, it will be marked as noise. Otherwise it will be marked as being in the current cluster and the ExpandCluster (Algorithm 2) function will be called. Its purpose is to find all points that are density-reachable from P and are currently being marked as unclassified or noise. Despite being a recursive function, ExpandCluster is implemented without using recursion explicitly. The recursive behaviour is accomplished by using a set whose size varies as new density-reachable points are found. The algorithm ends when all points have been properly classified.

File:DBSCAN Algorithm.png

Finally, after identifying all clusters, one might wonder that a border point might belong to two clusters at the same time. For this matter, the currently implement algorithm will consider the ambiguous points as being part of the cluster which aggregated them firstly.

File:DBSCAN ExpandCluster Algorithm.PNG

DBSCAN on R[edit | edit source]

R is a programming language and software environment for statistical computing. Besides being a widely used tool for statistical analysis, R aggregates several data mining techniques as well. Therefore, it has become a major tool for simple tasks aiming to discover knowledge on databases. R's source code is freely available under the GNU General Public License and has been ported for several platforms other than Unix varieties, like Windows and MacOS. Although R uses primarily a command line interface, several GUIs are available, which increases its user-friendliness. The DBSCAN technique is available on R's fpc package, by Christian Hennig, which implements clustering tasks for fixed point clusters. The DBSCAN implementation offers high-configurability, as it allows choosing several parameters and options values. Moreover, fpc's DBSCAN has a visualization interface, which make it possible to visualize the clustering process iteratively.

The fpc package is, unfortunately, rather slow because it is written mostly in R. A much faster version of DBSCAN exists in the package dbscan.

Installation[edit | edit source]

The fpc package can be installed by using the following command on R's command-line:

install.packages("fpc", dependencies = TRUE)

The above command shall recursively download and install all packages that fpc depend to along with fpc itself.

Use[edit | edit source]

To start using the fpc package, the following command must be invoked:

library('fpc')

DBSCAN Procedure[edit | edit source]

The fpc package allows the user to use the following procedure for applying dbscan technique directly:

dbscan(data, eps, MinPts, scale, method, seeds, showplot, countmode)

Parameters[edit | edit source]

The DBSCAN procedure takes the following parameters:

  • data: The data that will be clustered. It can be a data matrix, a data.frame, dissimilarity matrix or dist-object.
  • eps: Reachability distance (discussed before).
  • MinPts: Reachability minimum number of points (discussed before).
  • scale: Used for scaling the data.
  • method: Confifigures memory usage by constraining performance, there are three options:
    • "raw": treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow).
    • "dist": treats data as distance matrix (relatively fast but memory expensive).
    • "hybrid": expects also raw data, but calculates partial distance matrices (very fast with moderate memory requirements.
  • seeds: Configuration regarding to the inclusion of the isseed-vector in the dbscan-object, can be TRUE or FALSE.
  • showplot: Plotting configuration:
    • 0: no plot
    • 1: plot per iteration
    • 2: plot per subiteration
  • countmode: NULL or vector of point numbers for reporting progress.

The first three parameters are, by far, the most important ones: data is the input data used for clustering; eps and MinPts variables retains the same meaning as before, that is, they are the minimum distance between elements and the necessary quantity of points to form a cluster respectively. The parameter showplot is related to the visualization of the results. Parameter method impacts directly the efficiency of the algorithm and can be used for calibrating memory-performance tradeoff on memory-constrained environments.

Example[edit | edit source]

The example in this section will illustrate the fpc's DBSCAN usage on the database depicted in figure 1. It will also show the clustering mechanism as an iterative visualization process. Firstly, the data to be clustered must be created:

x <- matrix(scan("file.dat",1926), nrow=1926, ncol=2, byrow=TRUE); # Read 1926 points on file "file.dat".
par(bg="grey40");                                                   # Set background to gray.
plot(x);                                                           # Plot original database.
d <- dbscan(x,10,showplot = 2);                                    # Calls DBSCAN routine (eps = 10 and MinPts is set to its default, which is 5).
d;                                                                 # Shows DBSCAN result.
# Notice that setting showplot to 2 will make dbscan show the result iteratively by its sub-iterations.

Output[edit | edit source]

The DBSCAN's output shows information about the clusters that were just created:

dbscan Pts=1926 MinPts=5 eps=20
       0 1 2  3 4   5 6   7 8   9 10 11 12 13 14 15 16 17 18 19
seed   0 8 8 12 8 844 8 312 8 616  8 18  8  8 10 10  8  8 12  8
border 4 0 0  0 0   0 0   0 0   0  0  0  0  0  0  0  0  0  0  0
total  4 8 8 12 8 844 8 312 8 616  8 18  8  8 10 10  8  8 12  8

Each cluster is represented by a single column and the rows shows how many of seed (core), border and total points it has. Notice that the tiny clusters scattered throughout the database are, in fact, formed by more than a single point. Thus, they are considered valid clusters because the MinPts parameter was not set and its default value is too low.

Visualization[edit | edit source]

Points that belong to a cluster are delimited by triangles of different colors. All points belonging to the same cluster are delimited by triangles of the same color. Points that do not belong to any cluster retain the same color as before and, as will be shown, are usually represented by black circles. Moreover, setting showplot parameter to 2 implies that on each subiteration of the DBSCAN algorithm, the routine will show the partial clusters on the screen. Following are five of these moments:

File:DBSCAN Iteration1.PNG File:DBSCAN Iteration2.PNG File:DBSCAN Iteration3.PNG

File:DBSCAN Iteration4.PNG File:DBSCAN Iteration5.PNG

First image shows the points before the DBSCAN routine is called (command plot(x)). The second one shows the beginning of the clustering algorithm, the first major cluster is being formed in the upper center of the database. The fifth image shows the 19 clusters which were obtained when the algorithm had ended.

Case Study: Identifying Celestial Entities[edit | edit source]

The following sections will discuss a common class identification task assisted by DBSCAN and applied on spatial databases: identification of celestial entities on astronomy.

Scenario[edit | edit source]

Identifying celestial objects by capturing the radiation they emit is recurring task on astronomy. An astronomical entity might be itself the source of electromagnetic radiation (i.e. a star) or might reflect it from other sources (i.e. a planet). Typically, an entity will emit radiation on different wavelengths, which, together, will help identifying its class: it might be a planet, an star of any kind or age, or even a galaxy or other exotic entity previously unidentified. The intensity collected from each range of the electromagnetic spectrum is stored on an individual two-dimensional grid (e.g. one for ultraviolet and other for gamma rays). Modern research groups on astronomy are capable of collecting and storing thousands of large-dimension grids, each of them representing a different view (or image) from the sky and using up to several terabytes of storage space.

Besides capturing the electromagnetic intensity emitted from an entity at a given range of the spectrum, the sensors might capture noise caused by the sensors as well as diffuse emission from the atmosphere and the space itself. Traditionally, a method for eliminating diffuse emission and some of the noise is to constrain the relevant intensity by a known threshold. For example, the intensity will only be considered relevant when surpassing 5 rms. A greater problem arises due to the fact that electromagnets waves at different spectrum ranges will yield different interference when traversing a medium. This phenomenon is called Rayleigh Scattering and will ultimately cause the wave to deflect upon being captured by the sensor. Therefore, when considering several images taken from the same region of space, but from different spectrum ranges, the following steps that must done for correctly identifying the celestial entities.

Methodology and Execution[edit | edit source]

Figure 2: The original image. At a wavelength of 4,865 MHz, it shows the center of a galaxy.
Figure 3: Source emissions after pre-processing.
Figure 4: Clustering results for MinPts = 5 and eps = 5.

As an example, consider figure 2, which shows an image taken from the center of a galaxy. This image consists of a pathologic example of noise on an astronomical picture. Following is the methodology for extracting the celestial objects on this image.

Step 1: Pre-Processing[edit | edit source]

Firstly, a pre-processing step must be applied to the removal of noise and diffuse emission. As stated before, this might be accomplished by using a threshold. This step it is shown at figure 3, whereas the original image can be viewed at figure 2 and represents the original image depicting the center of a galaxy. The threshold was set to 50 which means that only pixels whose intensity are less than 50 (and consequently darker) are being considered.

Step 2: DBSCAN Clustering[edit | edit source]

Secondly, the DBSCAN algorithm can be applied on individual pixels to link together a complete emission area at the images for each channel of the electromagnetic spectrum. This is done by setting the eps parameter to some value that will define the minimum area required for a source to be considered. The eps parameter will define the distance metric in terms of pixels. Each of the generated cluster will define a celestial entity. Figure 4 shows the result of this step with eps and MinPts parameter set to 5.

x <- matrix(scan("file.dat",1214), nrow=1214, ncol=2, byrow=TRUE);
dbscan(x,5,showplot = 2);

Results:

dbscan Pts=1214 MinPts=5 eps=5
         0  1  2  3  4 5 6  7 8  9 10 11 12 13 14 15 16
seed     0 28 26 26 26 6 6 18 2 10 18  8 16 16  8 28 20
border 226  0  0  0  0 0 0  0 4  0  0  0  0  0  0  0  0
total  226 28 26 26 26 6 6 18 6 10 18  8 16 16  8 28 20
        17 18 19 20 21 22 23 24 25 26 27 28 29  30 31 32
seed    14 14  8 18  6  6  6 14  6  6  6 14  8 112  6 18
border   0  0  0  0  0  0  0  0  0  0  0  0  0   0  0  0
total   14 14  8 18  6  6  6 14  6  6  6 14  8 112  6 18
        33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
seed     6 20 20 20 16 10  6  6  8 14  8 10 14 10  6 36
border   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2
total    6 20 20 20 16 10  6  6  8 14  8 10 14 10  6 38
       49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
seed   24  6  6 14  6 24 18  6 16  6 14 28 26 26 10 10  8
border  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
total  24  6  6 14  6 24 18  6 16  6 14 28 26 26 10 10  8

Step 3: Multi-spectral Correlation[edit | edit source]

After identifying all clusters, one can apply a multi-espectral correlation process in order to consider the results (generated clusters) from every electromagnetic wavelength. It will not be detailed here, but a common approach would be only considering clusters which has one or more counterparts close enough with respect to some threshold on the other channels of the electrogmagnetic spectrum.

Analysis[edit | edit source]

Figure 4 shows the results after the clustering has been performed by the DBSCAN algorithm with MinPts and eps set to 5. By looking at the results, we can see that many isolated points have not been clustered because MinPts parameter restrict the size of a cluster by a minimum value of elements. This restriction will remove clusters that are too small to be classified as a relevant emission source and consequently are classified as noise. By restricting the MinPts parameter, celestial objects whose intensity is weak (e.g. they might be too far or do not emit strong radiation.) are eliminated from the results. This is generally desirable but can become a problem when emission sources defined by smaller areas must be analysed. Therefore, MinPts value must be carefully chosen since it may drastically modify the results.

Nevertheless, 64 clusters and 224 outliers were found. Most clusters do not have a large number of points, with some exceptions, like the large cluster located at the center of the database, which represents galaxy core which is an area with strong emission. The eps parameter takes an important role as it will define the minimum radius of pixels that represents the same emission source. Being set to 5 implies that points whose euclidian distance is greater than 5 do not belong to the same emission source. Although five is relatively large for the eps parameter, some clusters like those on (300,0) were not clustered together. This is because despite being close to each other they do not represent the same celestial object according to the eps value. Therefore, setting eps plays a role similar to that of MinPts, since it will define noise objects with respect to the distance measure, the same was done by the count measure of the MinPts.

Conclusion[edit | edit source]

The need to automatize the process of class identification of celestial entities is becoming increasingly important as astronomy and astrophysics become prominent areas of research. There is a growing demand for this task as technology evolves ands yields more and larger data samples to be analysed. Otherwise impossible without computational aid, this task have become easier with the assistance of the DBSCAN technique, which allowed larger samples to be analysed along with more precise results.

References[edit | edit source]

  1. Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).
  2. FPC Package on CRAN: http://cran.r-project.org/web/packages/fpc/index.html
  3. Manual for the DBSCAN algorithm implemented on FPC at http://bm2.genes.nig.ac.jp/RGM2/R_current/library/fpc/man/dbscan.html
  4. Jörg Sander, Martin Ester, Hans-Peter Kriegel, Xiaowei Xu (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications