# Introduction

Clustering is an unsupervised machine learning algorithm that groups entities, from a dataset, that have high degree of similarity in the same cluster. Nowadays lots of areas are using these kinds of algorithms to separate datasets into groups in an automated way, and still have a good quality result.

The clustering process is not a universal process because there are many groups of datasets, for some of them the kind of metric used is relevant, for others the entities that represent each cluster are more interesting. Like dataset groups there are many clustering algorithms each one tries to take advantage of data type, so this way each one of them is more suited for a specific kind of data.

This section will explain a little more about the Partitioning Around Medoids (PAM) Algorithm, showing how the algorithm works, what are its parameters and what they mean, an example of a dataset, how to execute the algorithm, and the result of that execution with the dataset as input.

# The Partitioning Around Medoids (PAM) Algorithm

## Algorithm

The PAM algorithm was developed by Leonard Kaufman and Peter J. Rousseeuw, and this algorithm is very similar to K-means, mostly because both are partitional algorithms, in other words, both break the dataset into groups (clusters), and both work by trying to minimize the error, but PAM works with Medoids, that are an entity of the dataset that represent the group in which it is inserted, and K-means works with Centroids, that are artificially created entity that represent its cluster.

The PAM algorithm partitions the dataset of n objects into k clusters, where both the dataset and the number k is an input of the algorithm. This algorithm works with a matrix of dissimilarity, whose goal is to minimize the overall dissimilarity between the representants of each cluster and its members. The algorithm uses the following model to solve the problem: ${\displaystyle F(x)={\text{minimize}}\sum _{i=1}^{n}\sum _{j=1}^{n}d(i,j)z_{ij}}$

Subject to:

${\displaystyle {\begin{array}{lll}\sum _{i=1}^{n}z_{ij}=1,j=1,2,..,n\\z_{ij}\leq y_{i},i,j=1,2,..,n\\\sum _{i=1}^{n}y_{i}=k,k={\text{number of clusters}}\\y_{i},z_{ij}\in \{0,1\},i,j=1,2,..,n\end{array}}}$

Where F(x) is the main function to minimize, d(i,j) is the dissimilarity measurement between the entities i and j, and zij is a variable that ensures that only the dissimilarity between entities from the same cluster will be computed in the main function. The others expressions are constraints that have the following functions: (1.) ensures that every single entity is assigned to one cluster and only one cluster, (2.) ensures that the entity is assigned to its medoid that represents the cluster, (3.) ensures that there are exactly k clusters and (4.) lets the decision variables assume just the values of 0 or 1.

The PAM algorithm can work over two kinds of input, the first is the matrix representing every entity and the values of its variables, and the second is the dissimilarity matrix directly, in the latter the user can provide the dissimilarity directly as an input to the algorithm, instead of the data matrix representing the entities. Either way the algorithm reaches a solution to the problem, in a general analysis the algorithm proceeds this way:

Build phase:

1. Choose k entities to become the medoids, or in case these entities were provided use them as the medoids;
2. Calculate the dissimilarity matrix if it was not informed;
3. Assign every entity to its closest medoid;

Swap phase:

4. For each cluster search if any of the entities of the cluster lower the average dissimilarity coefficient, if it does select the entity that lowers this coefficient the most as the medoid for this cluster;
5. If at least one medoid has changed go to (3), else end the algorithm.

## Implementation

THIS IS NOT DESCRIBING THE "PAM" ALGORITHM. This is a k-means variant using medoids, but not PAM with its charateristic BUILD and SWAP phases.

The pseudocode of PAM algorithm is shown below:

Algorithm 1: PAM Algorithm Input: E = {e1,e2,...en} (dataset to be clustered or matrix of dissimilarity)

k (number of clusters)
metric (kind of metric to use on dissimilarity matrix)
diss (flag indicating that E is the matrix of dissimilarity or not)

Output: M = {m1,m2,...,mk} (vector of clusters medoids)

L = {l(e) | e = 1,2,...,n} (set of cluster labels of E)

foreach mi € M do

mi ← ej € E; (e.g. random selection)

end if diss ≠ true

Dissimilarity ← CalculateDissimilarityMatrix(E, metric);

else

Dissimilarity ← E;

end repeat

foreach ei € E do
l(ei) ← argminDissimilarity(ei, Dissimilarity, M);
end
changed ← false;
foreach mi € M do
Mtmp ← SelectBestClusterMedoids(E, Dissimilarity, L);
end
if Mtmp ≠ M
M ← Mtmp;
changed ← true;
end

until changed = true;

In the R programming language, the PAM algorithm is available in the cluster package and can be called by the following command:

pam(x, k, diss, metric, medoids, stand, cluster.only, do.swap, keep.diss, keep.data,
trace.lev)

Where the parameters are:

x: numerical data matrix representing the dataset entities, or can be the dissimilarity matrix, it depends on the value of the diss parameter. In case x is a data matrix each row is an entity and each column is an variable, and in this case missing values are allowed as long as every pair of entities has at least one case not missing. In case x is a dissimilarity matrix it is not allowed to have missing values.

k: number of clusters that the dataset will be partitioned where 0 < k < n, where n is the number of entities.

diss: logical flag, if it is TRUE x is used as the dissimilarity matrix, if it is FALSE, then x will be considered as a data matrix.

metric: an string specifying each of the two metrics will be used to calculate the dissimilarity matrix, the metric variable can be “euclidean” to use the Euclidean distance, or can be “manhattan” to use the Manhattan distance.

stand: logical flag, if it is TRUE then the measurements in x will be standardized before calculating the dissimilarities. Measurements are standardized for each column, by subtracting the column's mean value and dividing by the variable's mean absolute deviation. If x is a dissimilarity matrix then this parameter is ignored.

cluster.only: logical flag, if it is TRUE, only the clustering will be computed and returned.

do.swap: logical flag, indicates if the swap phase should happen (TRUE) or not (FALSE).

keep.diss: logical flag indicating if the dissimilarities should (TRUE) or not (FALSE) be kept in the result.

keep.data: logical flag indicating if the input data x should (TRUE) or not (FALSE) be kept in the result.

trace.lev: an numeric parameters specifying a trace level for printing diagnostics during the build and swap phase of the algorithm. Default 0 does not print anything.

The PAM algorithm return a pam object that contains the information about the result of the execution of the algorithm.

## Visualization

In R there are two ways of seeing the result of the PAM algorithm, the first one is to print the object that the algorithm returns, and the second one is to plot the data from the object creating a graphic of the result. The first way of visualizing the information is a bit more complicated to understand but it gives a more complete and accurate information, but the second way is a lot more easy to understand and lets the user have a better view of the information and to add information that would be relevant for him.

To view the data of the result of the execution of PAM algorithm in a textual way there are two ways one more simple that gives a more summarized information about the object, and another one that gives you a more complete information about it. In the two commands listed below the first one prints the information in a summarized way, while the second one prints it in a more complete way.

print (result)
summary (result)

The other way of visualizing the data from the result of the execution of the algorithm is using graphics and that can be done by using the following command:

plot (result)

Example: To show an example of use of the algorithm and a result from its execution a simple dataset with few entities and few dimension was used, as it is shown in the following table:

Table 1: Simple dataset

Object Attribute x Attribute y
1 1 1
2 2 3
3 1 2
4 2 2
5 10 4
6 11 5
7 10 6
8 12 5
9 11 6

As we can see the data is separated in two clusters, so we will use an k = 2. The PAM algorithm can be executed as follows:

#load the table from a file

#execute the pam algorithm with the dataset created for the example
result <- pam(x, 2, FALSE, "euclidean")

#print the results data in the screen
summary(result)

#plot a graphic showing the clusters and the medoids of each cluster
plot(result$data, col = result$clustering)
points(result$medoids, col = 1:2, pch = 4) Printing the result form the execution gives you: Medoids: ID x y 4 4 2 2 6 6 11 5 Clustering vector: 1 2 3 4 5 6 7 8 9 1 1 1 1 2 2 2 2 2 Objective function: build swap 1.255618 0.915849 Numerical information per cluster: size max_diss av_diss diameter separation [1,] 4 1.414214 0.8535534 2.236068 8.062258 [2,] 5 1.414214 0.9656854 2.236068 8.062258 Isolated clusters: L-clusters: character(0) L*-clusters: [1] 1 2 Silhouette plot information: cluster neighbor sil_width 3 1 2 0.8898942 4 1 2 0.8788422 1 1 2 0.8549629 2 1 2 0.8297000 6 2 1 0.8790384 9 2 1 0.8631441 8 2 1 0.8425790 7 2 1 0.8232848 5 2 1 0.7747713 Average silhouette width per cluster: [1] 0.8633498 0.8365635 Average silhouette width of total data set: [1] 0.8484685 36 dissimilarities, summarized : Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0000 1.4142 8.3951 6.1559 9.9362 11.7050 Metric : euclidean Number of objects : 9 Available components: [1] "medoids" "id.med" "clustering" "objective" "isolation" "clusinfo" "silinfo" "diss" "call" [10] "data" While plotting gives you: ## Case Study In this section we will see a case study using PAM. ### Scenario In this case study a part of the database iris available in the R package datasets was used. This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Using the data that this dataset provides us, it is natural to think of verifying if the flowers of each one of three species of iris are really similar to the others from the same species, so in this case study length and width of both petal and sepal will be used to cluster the dataset into 3 groups and then verify if the clusters really match with the flowers species. The dataset that was used into this case study consist of the following columns: • Flower: An id of the flower; • Sepal.Length: A numeric value of the length of the sepal in centimeters; • Sepal.Width: A numeric value of the width of the sepal in centimeters; • Petal.Length: A numeric value of the length of the petal in centimeters; • Petal.Width: A numeric value of the width of the petal in centimeters; • Species: A text identifying the species of the flower. ### Input Data The input data is a table consisting of 50% (75 entities) of the original iris dataset that have 150 flowers and 5 attributes each. So the dataset used in this case study is represented by the following table: Table 2: Sample from iris dataset Flower Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa 11 5.4 3.7 1.5 0.2 setosa 12 4.8 3.4 1.6 0.2 setosa 13 4.8 3.0 1.4 0.1 setosa 14 4.3 3.0 1.1 0.1 setosa 15 5.8 4.0 1.2 0.2 setosa 16 5.7 4.4 1.5 0.4 setosa 17 5.4 3.9 1.3 0.4 setosa 18 5.1 3.5 1.4 0.3 setosa 19 5.7 3.8 1.7 0.3 setosa 20 5.1 3.8 1.5 0.3 setosa 21 5.4 3.4 1.7 0.2 setosa 22 5.1 3.7 1.5 0.4 setosa 23 4.6 3.6 1.0 0.2 setosa 24 5.1 3.3 1.7 0.5 setosa 25 4.8 3.4 1.9 0.2 setosa 51 7.0 3.2 4.7 1.4 versicolor 52 6.4 3.2 4.5 1.5 versicolor 53 6.9 3.1 4.9 1.5 versicolor 54 5.5 2.3 4.0 1.3 versicolor 55 6.5 2.8 4.6 1.5 versicolor 56 5.7 2.8 4.5 1.3 versicolor 57 6.3 3.3 4.7 1.6 versicolor 58 4.9 2.4 3.3 1.0 versicolor 59 6.6 2.9 4.6 1.3 versicolor 60 5.2 2.7 3.9 1.4 versicolor 61 5.0 2.0 3.5 1.0 versicolor 62 5.9 3.0 4.2 1.5 versicolor 63 6.0 2.2 4.0 1.0 versicolor 64 6.1 2.9 4.7 1.4 versicolor 65 5.6 2.9 3.6 1.3 versicolor 66 6.7 3.1 4.4 1.4 versicolor 67 5.6 3.0 4.5 1.5 versicolor 68 5.8 2.7 4.1 1.0 versicolor 69 6.2 2.2 4.5 1.5 versicolor 70 5.6 2.5 3.9 1.1 versicolor 71 5.9 3.2 4.8 1.8 versicolor 72 6.1 2.8 4.0 1.3 versicolor 73 6.3 2.5 4.9 1.5 versicolor 74 6.1 2.8 4.7 1.2 versicolor 75 6.4 2.9 4.3 1.3 versicolor 101 6.3 3.3 6.0 2.5 virginica 102 5.8 2.7 5.1 1.9 virginica 103 7.1 3.0 5.9 2.1 virginica 104 6.3 2.9 5.6 1.8 virginica 105 6.5 3.0 5.8 2.2 virginica 106 7.6 3.0 6.6 2.1 virginica 107 4.9 2.5 4.5 1.7 virginica 108 7.3 2.9 6.3 1.8 virginica 109 6.7 2.5 5.8 1.8 virginica 110 7.2 3.6 6.1 2.5 virginica 111 6.5 3.2 5.1 2.0 virginica 112 6.4 2.7 5.3 1.9 virginica 113 6.8 3.0 5.5 2.1 virginica 114 5.7 2.5 5.0 2.0 virginica 115 5.8 2.8 5.1 2.4 virginica 116 6.4 3.2 5.3 2.3 virginica 117 6.5 3.0 5.5 1.8 virginica 118 7.7 3.8 6.7 2.2 virginica 119 7.7 2.6 6.9 2.3 virginica 120 6.0 2.2 5.0 1.5 virginica 121 6.9 3.2 5.7 2.3 virginica 122 5.6 2.8 4.9 2.0 virginica 123 7.7 2.8 6.7 2.0 virginica 124 6.3 2.7 4.9 1.8 virginica 125 6.7 3.3 5.7 2.1 virginica ### Execution The process was done as follows: #import data data <- read.table(“sampleiris.txt”) #execution result <- pam(data[1:4], 3, FALSE, “euclidean”) #print results summary(result) #plot clusters plot (data, col = result$clustering)
#add the medoids to the plot
points(result\$medoids, col = 1:3, pch = 4)

### Output

The following data was printed as result of the execution:

Medoids:
ID Sepal.Length Sepal.Width Petal.Length Petal.Width
8    8          5.0         3.4          1.5         0.2
64  39          6.1         2.9          4.7         1.4
103 53          7.1         3.0          5.9         2.1
Clustering vector:
1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22
23  24  25  51  52  53  54  55  56
1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
1   1   1   2   2   2   2   2   2
57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75 101 102 103
104 105 106 107 108 109 110 111 112
2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   3   2   3
3   3   3   2   3   3   3   2   2
113 114 115 116 117 118 119 120 121 122 123 124 125
3   2   2   3   3   3   3   2   3   2   3   2   3
Objective function:
build      swap
0.7148339 0.6990539

Numerical information per cluster:
size max_diss   av_diss diameter separation
[1,]   25 1.236932 0.5137400 2.042058  1.9000000
[2,]   34 1.951922 0.8085343 2.727636  0.3741657
[3,]   16 1.284523 0.7559609 2.147091  0.3741657

Isolated clusters:
L-clusters: [1] 1
L*-clusters: character(0)

Silhouette plot information:
cluster neighbor   sil_width
1         1        2  0.84941732
5         1        2  0.84830238
8         1        2  0.84812593
18        1        2  0.84784555
12        1        2  0.83221128
22        1        2  0.82890349
20        1        2  0.82456328
3         1        2  0.82337894
7         1        2  0.81910409
10        1        2  0.81662688
11        1        2  0.80769429
2         1        2  0.80592613
13        1        2  0.80278163
4         1        2  0.79810574
23        1        2  0.79482977
24        1        2  0.78999596
17        1        2  0.78539723
21        1        2  0.78454015
25        1        2  0.77452963
6         1        2  0.75995941
9         1        2  0.74605493
14        1        2  0.74277337
19        1        2  0.72082914
15        1        2  0.71581750
16        1        2  0.66155611
68        2        3  0.60036142
56        2        3  0.59753885
62        2        3  0.59698924
72        2        3  0.59691421
70        2        3  0.59514179
54        2        3  0.58507022
67        2        3  0.56989428
60        2        1  0.56350914
63        2        3  0.55592514
75        2        3  0.54720666
74        2        3  0.53971473
64        2        3  0.53757677
69        2        3  0.51098390
65        2        1  0.50762488
107       2        3  0.48295375
55        2        3  0.46851074
52        2        3  0.46827948
59        2        3  0.44164146
66        2        3  0.42147865
71        2        3  0.41421605
73        2        3  0.41282512
122       2        3  0.40891392
120       2        3  0.40207904
57        2        3  0.39510378
114       2        3  0.37176468
124       2        3  0.34854822
102       2        3  0.33532624
61        2        1  0.32662688
58        2        1  0.20142024
51        2        3  0.19024422
115       2        3  0.16320750
53        2        3  0.11554863
112       2        3 -0.07433144
111       2        3 -0.07748205
103       3        2  0.59622203
106       3        2  0.59241159
108       3        2  0.58027197
110       3        2  0.56716967
123       3        2  0.56182697
121       3        2  0.55568135
119       3        2  0.53242285
118       3        2  0.52551154
125       3        2  0.51206488
105       3        2  0.49243542
101       3        2  0.45749953
113       3        2  0.44409513
109       3        2  0.37181492
117       3        2  0.26375026
116       3        2  0.21777715
104       3        2  0.21412781
Average silhouette width per cluster:
[1] 0.7931708 0.4153331 0.4678177
Average silhouette width of total data set:
[1] 0.5524757

2775 dissimilarities, summarized :
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.1000  1.1136  2.5080  2.6329  3.9006  7.0852
Metric :  euclidean
Number of objects : 75

Available components:
[1] "medoids"    "id.med"     "clustering" "objective"  "isolation"  "clusinfo"
"silinfo"    "diss"       "call"
[10] "data"

And the following graphic was generated as well:

### Analysis

After the execution of the algorithm and the analysis of the data it was possible to tell that the clusters were well grouped and correlated with the species of each flower. In the data there was a total of 75 elements, 25 from Setosa species, 25 from Versicolor species and 25 from Virginica species, and the algorithm clustered the elements from Setosa as cluster 1, the ones from Versicolor as cluster 2 and the ones from Virginica as cluster 3. After verifying the results we find that out of 75 elements, 66 were correctly clustered, giving an error margin of 12%—which is a very good result.

# References

1. The R Development Core Team, R: A Language and Environment for Statistical Computing.
2. Kaufman, L., Rousseeuw, P. J., Clustering by Means of Medoids.