Data Mining Algorithms In R/Clustering/Hybrid Hierarchical Clustering

From Wikibooks, open books for an open world
Jump to navigation Jump to search
Clipboard

To do:
This section is still being written. But feel free to add your contribution or help in any way


A Hybrid Hierarchical Clustering is a clustering technique that trys to combine the best characteristics of both types of Hierarchical Techniques (Agglomerative and Divisive).


Introduction[edit | edit source]

Clustering can be considered an important unsupervised learning problem, which tries to find similar structures within an unlabeled data collection (JAIN, MURTY, FLYNN, 1999) [1].

These similar structures are data groups, better known as clusters. The data inside each cluster is similar (or close) to elements within its cluster, and is dissimilar to (or further from) elements that belong to other clusters. The clustering techniques' goal is to determine the intrinsic grouping in a data set (JAIN, MURTY, FLYNN, 1999) [1].

Clustering Techniques[edit | edit source]

There are several clustering techniques but none can be considered the absolute best method. Each technique has his own merits and flaws, which usually leave the job to the user to determine which clustering method will better satisfy his needs.

Hierarchical Clustering[edit | edit source]

The hierarchical clustering functions basically in joining closest clusters until the desired number of clusters is achieved. This kind of hierarchical clustering is named agglomerative because it joins the clusters iteratively. There is also a divisive hierarchical clustering that does a reverse process, every data item begin in the same cluster and then it is divided in smaller groups (JAIN, MURTY, FLYNN, 1999)[1].

The distance measurement between clusters can be done in several ways, and that's how hierarchical clustering algorithms of single, average and complete differ.

In the single-link clustering, also known as minimum method, the distance between two clusters is considered to be the minimum distance between all pairs of data items. In the complete link clustering, also known as maximum method, the distance between two clusters is considered to be the maximum distance between all pairs of data items. The clusters found by the complete link algorithm are usually more compact than the ones found by the single link. However, the single link algorithm is more versatile (JAIN, MURTY, FLYNN, 1999)[1].

In the average link clustering, the distance between two clusters is equal to the average distance between all data. A variation of this method uses median distance, which is less sensitive to greater data variation than the average distance.

Many hierarchical clustering algorithms have an appealing property that the nested sequence of clusters can be graphically represented with a tree, called a 'dendrogram' (CHIPMAN, TIBSHIRANI, 2006)[2]. Figure 1 shows an dendogram example.

Figure 1: A dendogram obtained using a single-link agglomerative clustering algorithm. Source: Jain, Murty, Flynn (1999)[1].

The greatest disadvantage of the hierarchical clustering is his high complexity order of O(n^2 log n) (JAIN, MURTY, FLYNN, 1999)[1]. The Hierarchical approach however has the advantage of handling clusters with different densities, much better than other clustering techniques.

Hybrid Hierarchical Clustering[edit | edit source]

Agglomerative algorithms group data into many small clusters and few large ones, which usually makes them good at identifying small clusters but not large ones. Divisive algorithms however, have reversed characteristics; making them good at identifying large clusters in general (CHIPMAN, TIBSHIRANI, 2006)[2].

Hybrid Hierarchical Clustering Techniques try to combine the best advantages of both Agglomerative and Divisive techniques (LAAN, POLLARD, 2002; CHIPMAN, TIBSHIRANI, 2006)[2][3]. The way this combination is implemented depends on the chosen algorithm.

In this section it will be presented a hybrid hierarchical algorithm that uses the concept of 'mutual cluster' to combine the divisive techniques procedures with information gained from a preliminary agglomerative clustering.

Mutual Cluster[edit | edit source]

A mutual cluster can be defined as a group of data that are collectively closer to each other than to any other data, and distant from all other data. The data contained in a mutual cluster should never be separated (CHIPMAN, TIBSHIRANI, 2006)[2].

Basically that requires in a mutual cluster that, the largest distance between data elements in S to be smaller than the smallest distance from an element in S to any data not in S:

Where S is a subset of the data, d is a distance function between two data items, x is an element belonging to S, and y is an element that does not belong to S.

There is an interesting mutual cluster's property which indicates that a mutual cluster is not broken by an agglomerative clustering with any of single, average or complete linkage approaches (CHIPMAN, TIBSHIRANI, 2006)[2]. For more information on single, average or complete linkage approaches, see the agglomerative techniques in the hierarchical clustering section.

This property has several implications on mutual clusters. The most obvious is to support the idea that mutual clusters contain strong clustering information, no matter which linkage approach is used. This property also aids in the interpretation of agglomerative methods. This additional information can aid in the interpretation of mutual clusters, or in the decision of what clusters to divide.

Another implication of this property would be that mutual clusters can't be broken by agglomerative methods, which indicates that all mutual clusters can be identified by examining nested clusters in this method.

Algorithm[edit | edit source]

The hybrid hierarchical algorithm using mutual clusters can be described in three steps:

  1. Compute the mutual clusters using an agglomerative technique.
  2. Perform a constrained divisive technique in which each mutual cluster must stay intact. This is accomplished by temporarily replacing a mutual cluster of data elements by their centroid.
  3. Once the divisive clustering is complete, divide each mutual cluster by performing another divisive clustering 'within' each mutual cluster.

The mutual cluster property and implications described earlier, indicates how it is possible to use agglomerative techniques to easily find mutual clusters in Step 1. For Step 2 any top-down method could be employed, it doesn't even need to be a divisive hierarchical algorithm. In Step 3, an agglomerative method can be used instead. Since mutual cluster are usually small, the use of either a top-down or bottom-up approach in Step 3 will give similar results. Using a top-down at both steps 2 and 3 just seems simpler.

Implementation in R[edit | edit source]

Package: hybridHclust
Descrption: hybrid hierarchical clustering via mutual clusters
Version: 1.0-3
Depends: cluster
Published: 2008-04-08
Author: Hugh Chipman, Rob Tibshirani, with tsvq code originally from Trevor Hastie
Maintainer: Hugh Chipman <hugh.chipman@acadiau.ca>
License: GPL-2
URL: http://ace.acadiau.ca/math/chipmanh/hybridHclust
Repository: CRAN
In views: Cluster, Multivariate

To download this package, visit the CRAN Mirrors page and select the mirror closest to your region. Once there select 'packages' and search for 'hybridHclust'.

Details:

The hybridHclust package uses the mutual cluster concept to construct a clustering in which mutual clusters are never broken. This is achieved by temporarily "fusing" together all points in a mutual cluster so that they have equal coordinates. The resultant top-down clusterings are then "stitched" together to form a single top-down clustering (CHIPMAN, TIBSHIRANI, 2006; 2008)[2][4].

"Only maximal mutual clusters are constrained to not be broken. Thus if points A, B, C, D are a mutual cluster and points A, B are also a mutual cluster, only the four points will be forbidden from being broken" (CHIPMAN, TIBSHIRANI, 2008)[4].

This package uses squared Euclidean distance between rows of x as a distance measurement. In some instances, a desirable distance measure is d(x1,x2)=1-cor(x1,x2), if x1 and x2 are 2 rows of the matrix x. This correlation-based distance is equivalent to squared Euclidean distance once rows have been scaled to have mean 0 and standard deviation 1. This can be accomplished by pre-processing the data matrix before calling the hybrid clustering algorithm (CHIPMAN, TIBSHIRANI, 2008)[4].

The main method to perform Hybrid clustering is called hybridHclust.

Usage
hybridHclust(x, themc=NULL, trace=FALSE)

Arguments
x - A data matrix whose rows are to be clustered
themc - An object representing the mutual clusters in x, typically generated by mutualCluster function. If it is not provided, it will be calculated.
trace – Indicates if internal steps be printed as they execute.

Return
A dendogram in hclust format

Visualization[edit | edit source]

The hybridHClust function returns a dendogram representing the algorithm clusters. A dendogram can be seem using the 'plot' function.

The mutual clusters can be calculated using the mutual cluster function. The function has a 'plot' parameter that automatically plots the mutual clusters dendogram on a graph. To print the mutual clusters is necessary to use the 'print.mutualCluster' function.

Please remember to load the hybridHclust R package before trying any of the example codes.

Example[edit | edit source]

# Function to show multiple graphs plotting, in this case 3 graphs in one row 
par(mfrow=c(1,3))
# An Example Data 
x <- cbind(c(-1.4806,1.5772,-0.9567,-0.92,-1.9976,-0.2723,-0.3153),c( -0.6283,-0.1065,0.428,-0.7777,-1.2939,-0.7796,0.012))
# Plot the example data
plot(x, pch = as.character(1:nrow(x)), asp = 1)
# Calculate the mutual clusters
mc1 <- mutualCluster(x, plot=TRUE)
print.mutualCluster(mc1) # print the mutual clusters
dist(x) # distance between data so you can verify that mc's are correct
# Calculate the Hybrid Hierarchical Cluster
hyb1 <- hybridHclust(x)
# Plot the Hybrid Clustering Dendogram
plot(hyb1)

Case Study[edit | edit source]

To better illustrate the clustering technique, it will be shown a simple case study.

The "Big Five", in contemporary psychology, consists of five important personality factors. These five factors have been found to contain and subsume more-or-less all known personality traits within their five domains and to represent the basic structure behind all personality traits.

The Big Five factors and their constituent traits can be summarized as follows:

  • Openness - appreciation for art, emotion, adventure, unusual ideas, curiosity, and variety of experience.
  • Conscientiousness - a tendency to show self-discipline, act dutifully, and aim for achievement; planned rather than spontaneous behavior.
  • Extraversion - energy, positive emotions, urgency, and the tendency to seek stimulation in the company of others.
  • Agreeableness - a tendency to be compassionate and cooperative rather than suspicious and antagonistic towards others.
  • Neuroticism - a tendency to experience unpleasant emotions easily, such as anger, anxiety, depression, or vulnerability.

The most common way for a psychologist to measure these factors in someone’s personality, is to apply a 'test' with descriptive sentences. The test scores for these traits are frequently presented as percentile scores.

Scenario/Situation/Development[edit | edit source]

Let's assume Big Five scores being applied to a population. In this case the test also comes with a socio-economic test for research purposes. The tests could be clustered in groups based on their percentile scores in each factor and the socio-economic variable, so that psychologists may better analyze the results and their possible effect on the population.

Input Data[edit | edit source]

The input data is a six dimensional numerical data consisting in the five factors percentile scores and the person’s family income. The input data was randomly generated using a normal distribution on each dimension. The random generated data saves time and avoid the legal issue in using a person’s possible privilege information (even anonymously).

The columns O, C, E, A, N represent respectively Openness, Conscientiousness, Extraversion, Agreableness and Neuroticism factors. The values on all of these columns are presented as percentile results from the Big Five test. The percentiles for each factor vary from a minimum of 5 up to a maximum of 95 in intervals of 5.

The column I represent the person’s family income, the value is the number of minimum wages that family earn.

A csv file with the input data can be found here: HybridHClust Case Study CSV Input File

  O C E A N I
1. 10 30 40 55 45 12
2. 50 85 25 30 40 6
3. 95 90 80 50 5 10
4. 20 65 20 15 50 5
5. 50 25 65 10 95 13
6. 40 85 25 10 65 10
7. 55 50 90 90 75 9
8. 35 80 5 40 35 10
9. 30 65 85 35 80 13
10. 20 45 30 50 25 13
11. 40 30 65 10 25 14
12. 50 75 85 50 10 6
13. 65 75 50 5 50 9
14. 90 30 50 35 95 10
15. 60 80 10 75 50 6
16. 25 85 25 50 20 4
17. 10 30 90 50 35 13
18. 75 10 85 55 5 10
19. 65 65 20 50 15 16
20. 60 70 60 60 25 9
21. 35 70 30 40 45 6
22. 55 5 90 70 70 13
23. 15 20 60 40 60 10
24. 20 40 75 70 15 9
25. 30 95 25 65 20 7
26. 90 75 30 20 70 2
27. 95 20 65 80 45 15
28. 45 50 85 70 65 4
29. 60 90 70 25 5 14
30. 70 45 65 50 40 9
31. 50 50 65 65 45 7
32. 50 95 15 35 60 5
33. 70 15 90 80 50 9
34. 85 30 20 30 80 2
35. 70 40 45 85 30 2
36. 75 55 80 85 25 4
37. 20 55 25 35 90 4
38. 85 10 5 55 80 15
39. 5 15 75 35 25 10
40. 75 50 45 60 40 16
41. 65 85 35 90 10 3
42. 25 50 20 15 65 12
43. 15 15 50 75 80 3
44. 30 95 30 45 75 14
45. 80 50 85 20 25 5
46. 35 35 60 25 35 8
47. 90 55 50 15 35 5
48. 35 65 95 35 20 7
49. 30 70 60 25 45 15
50. 30 55 50 30 65 9
51. 35 90 70 20 20 6
52. 60 35 95 10 15 9
53. 25 60 35 90 25 10
54. 10 5 10 45 20 6
55. 80 5 15 75 90 14
56. 20 40 80 35 15 5
57. 80 60 95 65 70 4
58. 65 30 75 30 65 15
59. 15 45 55 50 70 6
60. 20 5 55 55 35 1
61. 40 10 75 70 30 3
62. 20 90 65 80 75 10
63. 95 40 40 20 5 4
64. 45 45 75 25 45 11
65. 80 95 50 45 10 14
66. 60 25 50 70 80 6
67. 85 60 45 95 55 9
68. 95 10 70 60 20 12
69. 65 75 25 50 40 15
70. 10 50 35 10 50 10
71. 50 85 85 40 65 8
72. 45 55 10 10 70 1
73. 5 50 95 55 90 3
74. 35 15 35 15 50 15
75. 95 40 75 50 50 4
76. 50 40 95 65 5 4
77. 40 80 50 95 5 14
78. 55 50 70 50 15 9
79. 90 45 55 30 65 1
80. 20 60 95 95 50 15
81. 85 50 95 95 90 11
82. 5 55 55 25 95 10
83. 15 15 40 20 15 9
84. 80 50 50 50 35 2
85. 65 90 65 50 35 3
86. 75 70 85 20 30 7
87. 65 10 85 10 50 1
88. 75 10 85 15 5 10
89. 50 25 15 20 30 6
90. 15 30 60 10 75 7
91. 30 15 45 25 25 11
92. 75 95 5 95 30 10
93. 55 15 70 30 40 12
94. 10 15 50 30 65 5
95. 5 50 50 60 25 7
96. 75 5 75 65 90 15
97. 80 25 90 90 50 6
98. 60 15 10 75 80 9
99. 95 15 95 95 5 7
100. 55 30 70 85 90 5

Execution[edit | edit source]

# Read the data file
x <- read.csv("hybridHclust_case_study_input.csv")
# calculate the mutual clusters
mc <- mutualCluster(x)
# Calculate the Hybrid Hierarchical Cluster
hyb <- hybridHclust(x, mc)
# Plot the Hybrid Clustering Dendogram
plot(hyb)

Output[edit | edit source]

The clustering function shows a dendogram of the clustered data, with that it is possible to decide how many clusters we want to use. The red line in the figure below demonstrates this choice for analysis, is this case the clusters below the red line will be considered.


Figure 2: Hybrid clustering dendogram of the above case study data

Analysis[edit | edit source]

Because the case study has a multi-dimensional data it’s harder to visualize what the clusters have in common. However, when a good number of clusters are chosen, it is possible to see their similarities and better analyze the results.

Look at the cluster formed by 50, 59, 23, 94, 43, 82, 90, 5, 4, 42, 70, 72 and 37. The cluster has mid to high values of neuroticism [50, 95], mid to low values of openness [5, 50], and low to mid high values of conscientiousness [15, 65].

  O C E A N I
4. 20 65 20 15 50 5
5. 50 25 65 10 95 13
23. 15 20 60 40 60 10
37. 20 55 25 35 90 4
42. 25 50 20 15 65 12
43. 15 15 50 75 80 3
50. 30 55 50 30 65 9
59. 15 45 55 50 70 6
70. 10 50 35 10 50 10
72. 45 55 10 10 70 1
82. 5 55 55 25 95 10
90. 15 30 60 10 75 7
94. 10 15 50 30 65 5


Another interesting cluster is the one formed by 74, 89, 83, 91, 54, 10, 95 and 1. The cluster has mid to low values of openness [5, 50], mid to low values of conscientiousness [5, 50], mid to low values of extraversion [10, 50], and mid to low values of neuroticism [15, 50].

  O C E A N I
1. 10 30 40 55 45 12
10. 20 45 30 50 25 13
54. 10 5 10 45 20 6
74. 35 15 35 15 50 15
83. 15 15 40 20 15 9
89. 50 25 15 20 30 6
91. 30 15 45 25 25 11
95. 5 50 50 60 25 7

If the values range in a cluster was an indication of a risk behavior or pathology, the psychologist would be able to pay more attention to anyone in that group for example. Or with more accurate socio-economic variables it would be possible to see correlations between the big five factors and other socio-economic factors.

References[edit | edit source]

  1. a b c d e f Jain, A. K., Murty, M. N., Flynn, P. J. (1999) "Data Clustering: A Review". ACM Computing Surveys (CSUR), 31(3), p.264-323, 1999.
  2. a b c d e f Chipman, H., Tibshirani, R. (2006) "Hybrid hierarchical clustering with applications to microarray data". Biostatistics Advance Access, 7(2), p.286-301, apr 2006.
  3. Laan, M., Pollard, K. (2002) "A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap". Journal of Statistical Planning and Inference, 117 (2), p.275-303, dec 2002.
  4. a b c Chipman, H., Tibshirani, R. (2008) "Package hybridHClust", apr 2008.