Data Mining Algorithms In R/Clustering/K-Cores

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Introduction

[edit | edit source]

In social networks analysis one of the major concerns is identification of cohesive subgroups os actores[check spelling] within a network. Friendship relation, publications citation, and many other more. Many studies and researches are focused on social network analysis, including in data mining. It is really important to find patterns in behavior of large online social networks, so the firms behind are able to create better mechanism to handle all that information with lower cost.

Online services such as Orkut, Facebook, Twitter and so on, have millions of users using their services simultaneously and interacting with others. Even in different services, the behavior of the network is similar.

People tend to interact in the same way as they do in real life, in a structure called “small world”, where people in a social network can reach any other person with less than seven steps. Such behavior can be studied to prevent disease propagation or to predict how fast an information can flow in society.

Several notions were introduced to formally describe cohesive groups: cliques, n–cliques, n–clans, n–clubs, k–plexes, k–cores, lambda sets, . . . For most of them it turns out that they are algorithmically difficult, classified as NP hard. However, for cores very efficient algorithm exists.

The Technique

[edit | edit source]

There are several tasks when handling with social networks. There are some of them listed above.

Link-based object classification – We can classify the object based on its links.

Object type prediction – We can predict the object type based on the objects linked to it.

Link type prediction – Here we want to predict the link type based on the objects linked by it.

Predicting link existence – Now we don't want to know nothing about the link, but it's existence. How to predict when a link exists between two objects? That's the point here.

Link cardinality estimation – There are several ways to estimate the cardinality of a link, such as counting the number of links of an object, or counting the numbers of smallest paths that pass through an object.

Object reconciliation – The task is to check whether two objects are the same, based on the links and attributes. Remove duplicate instances are useful in many applications.

Group detection – A clustering task. Here we want to know when a group of objects belong to the same group.

Subgraph detection - Subgraph identification is to find characteristic subgraphs in a network. This is a form of search in graphs.

Metadata mining – Metadata is data about data.

The technique that will be explained in this text is about group detection in social networks, based on the degree of each node in the network.

The Algorithm

[edit | edit source]

K-cores in graph theory were introduced by Seidman in 1983 and by Bollobas in 1984 as a method of (destructively) simplifying graph topology to aid in analysis and visualization. They have been more recently defined as the following by Batagelj et al.: given a graph G = {V,E}> with vertices set V and edges set E, the k-core is computed by pruning all the vertices (with their respective edges) with degree less than k. That means that if a vertex u has degree du, and it has n neighbors with degree less than k, then u's degree becomes du − n, and it will be also pruned if k > du − n.

This operation can be useful to filter or to study some properties of the graphs. For instance, when you compute the 2-core of graph G, you are cutting all the vertices which are in a tree part of graph. (A tree is a graph with no loops).

Note that, there is a refinement (possibly empty) of the k-core of a graph G, for which there exists at least k paths between any two pairs of vertices of G. This concept is called structural cohesion in sociology[1] and vertex-connectivity in graph theory, and is equivalent via the Menger theorem to a k-component, a maximal graph that cannot be disconnected by fewer than k nodes.

Cores

[edit | edit source]

The notion of core is presented in Butts (2010) as following:

Let G = (V, E) be a graph, and let f (v, S, G) for v ∈ V, S ⊆ V be a real-valued vertex property function (in the language of Batagelj and Zaversnik). Then some set H ⊆ V is a generalized k-core for f if H is a maximal set such that f (v, H, G) ≥ k for all v ∈ H. Typically, f is chosen to be a degree measure with respect to S (e.g., the number of ties to vertices in S). In this case, the resulting k-cores have the intuitive property of being maximal sets such that every set member is tied (in the appropriate manner) to at least k others within the set.

Degree-based k-cores are a simple tool for identifying well-connected structures within large graphs. Let the core number of vertex v be the value of the highest-value core containing v. Then, intuitively, vertices with high core numbers belong to relatively well-connected sets (in the sense of sets with high minimum internal degree). It is important to note that, while a given k-core need not be connected, it is composed of subsets which are themselves well-connected; thus, the k-cores can be thought of as unions of relatively cohesive subgroups. As k-cores are nested, it is also natural to think of each k-core as representing a “slice” through a hypothetical “cohesion surface” on G. (Indeed, k-cores are often visualized in exactly this manner.)

The kcores function produces degree-based k-cores, for various degree measures (with or without edge values). The return value is the vector of core numbers for V, based on the selected degree measure. Missing (i.e., NA) edge are removed for purposes of the degree calculation.


Implementation in R

[edit | edit source]

Package sna

[edit | edit source]

Butts (2010) presents a series of implementations in R about algorithms to analysis in social networks. The implementations uses the package "sna" as follow:

  • Package: ‘sna’
  • Version: 2.0-1
  • Date: 2009-06-07
  • Title: Tools for Social Network Analysis
  • Author: Carter T. Butts <buttsc@uci.edu>
  • Maintainer: Carter T. Butts <buttsc@uci.edu>
  • Depends: R (>= 2.0.0), utils
  • Suggests: network, rgl, numDeriv, SparseM, statnet
  • Description: A range of tools for social network analysis, including node and graph-level indices, structural distance and covariance methods, structural equivalence detection, p* modeling, random graph generation, and 2D/3D network visualization.
  • License: GPL (>= 2)
  • URL: http://erzuli.ss.uci.edu/R.stuff
  • Repository: CRAN
  • Date/Publication: 2009-06-08 07:08:51

To install the package "sna" see http://erzuli.ss.uci.edu/R.stuff

Description

[edit | edit source]

kcores calculates the k-core structure of the input network, using the centrality measure indicated in cmode.

Usage

[edit | edit source]
   kcores(dat, mode = "digraph", diag = FALSE, cmode = "freeman", ignore.eval = FALSE)

Arguments

[edit | edit source]
                        one or more (possibly valued) graphs.
   dat
                        "digraph" for directed data, otherwise "graph".
   mode
                        logical; should self-ties be included in the degree calculations?
   diag
                        the degree centrality mode to use when constructing the cores.
   cmode
                        logical; should edge values be ignored when computing degree?
   ignore.eval

Value

[edit | edit source]

A vector containing the maximum core membership for each vertex.

function (dat, mode = "digraph", diag = FALSE, cmode = "freeman", 
    ignore.eval = FALSE) 
{
    dat <- as.edgelist.sna(dat, as.digraph = TRUE, suppress.diag = TRUE)
    if (is.list(dat)) 
        return(lapply(dat, kcores, dat = dat, mode = mode, diag = diag, 
            cmode = cmode, ignore.eval = ignore.eval))
    if (mode == "graph") 
        cmode <- "indegree"
    n <- attr(dat, "n")
    m <- NROW(dat)
    corevec <- 1:n
    dtype <- switch(cmode, indegree = 0, outdegree = 1, freeman = 2)
    if (!(cmode %in% c("indegree", "outdegree", "freeman"))) 
        stop("Illegal cmode in kcores.\n")
    solve <- .C("kcores_R", as.double(dat), as.integer(n), as.integer(m), 
        cv = as.double(corevec), as.integer(dtype), as.integer(diag), 
        as.integer(ignore.eval), NAOK = TRUE, PACKAGE = "sna")
    if (is.null(attr(dat, "vnames"))) 
        names(solve$cv) <- 1:n
    else names(solve$cv) <- attr(dat, "vnames")
    solve$cv
}

Case Study

[edit | edit source]

To illustrate the k-cores algorithm a simple case study will be shown.

Scenario

[edit | edit source]

Imagine a group of students dining. Each of them may choose another one to sit with him. Each student is a node and each possibility of sitting together is an edge. We want to group them as friends they are separately.

Input Data

[edit | edit source]

The input data is an adjacency matrix representing the network with students and friendship.

   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1   0  1  2  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0
2   1  0  0  2  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0
3   0  0  0  0  0  0  0  0  1   0   2   0   0   0   0   0   0   0   0   0   0
4   0  0  0  0  1  0  0  2  0   0   0   0   0   0   0   0   0   0   0   0   0
5   0  0  0  1  0  0  0  0  0   0   0   0   0   0   2   0   0   0   0   0   0
6   0  0  0  0  0  0  0  0  2   0   0   0   0   0   0   0   0   0   0   1   0
7   0  0  0  0  0  2  0  0  0   0   0   0   0   0   1   0   0   0   0   0   0
8   0  0  0  0  2  0  0  0  0   0   0   0   0   0   1   0   0   0   0   0   0
9   0  0  0  0  0  1  0  0  0   0   0   0   0   2   0   0   0   0   0   0   0
10  0  0  0  0  0  0  0  0  0   0   0   0   0   0   2   0   0   1   0   0   0
11  0  0  2  0  0  0  0  0  1   0   0   0   0   0   0   0   0   0   0   0   0
12  0  0  0  0  0  0  0  0  0   0   0   0   1   0   0   0   0   0   0   2   0
13  0  0  0  0  0  0  0  0  0   0   0   2   0   0   0   0   0   0   0   0   0
14  0  0  0  0  0  0  0  0  2   0   0   0   0   0   1   0   0   0   0   0   0
15  0  0  0  0  0  0  0  0  2   1   0   0   0   0   0   0   0   0   0   0   0
16  0  0  0  0  0  0  0  0  0   0   0   0   2   0   0   0   0   0   1   0   0
17  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   2   0   0   1
18  0  0  0  0  0  0  0  0  2   0   0   0   0   1   0   0   0   0   0   0   0
19  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   1   0   0   2
20  0  0  0  0  0  0  0  0  0   1   2   0   0   0   0   0   0   0   0   0   0
21  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   1   0   2   0   0
22  0  0  0  0  0  0  0  0  0   0   0   0   2   0   0   0   1   0   0   0   0
23  0  0  0  0  2  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0
24  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   2   0   0   1   0
25  0  0  0  0  0  0  0  0  0   0   0   0   0   0   1   0   2   0   0   0   0
26  0  0  0  0  0  0  0  0  0   0   0   0   1   0   0   0   0   0   0   0   0
   V22 V23 V24 V25 V26
1    0   0   0   0   0
2    0   0   0   0   0
3    0   0   0   0   0
4    0   0   0   0   0
5    0   0   0   0   0
6    0   0   0   0   0
7    0   0   0   0   0
8    0   0   0   0   0
9    0   0   0   0   0
10   0   0   0   0   0
11   0   0   0   0   0
12   0   0   0   0   0
13   1   0   0   0   0
14   0   0   0   0   0
15   0   0   0   0   0
16   0   0   0   0   0
17   0   0   0   0   0
18   0   0   0   0   0
19   0   0   0   0   0
20   0   0   0   0   0
21   0   0   0   0   0
22   0   0   0   0   0
23   0   0   1   0   0
24   0   0   0   0   0
25   0   0   0   0   0
26   0   0   2   0   0

When there's value 1 it means the first choice and 2 the second one.


Execution

[edit | edit source]

To determine the maximum k-core for each vertex from the network, we have:

kcores(people)

Where "people" is the matrix.

Output

[edit | edit source]

A vector containing the maximum core membership for each vertex.

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 

Analysis

[edit | edit source]

As it is a small network, the maximum k-core is for k=2, so there aren't clear group definition.

Package RBGL

[edit | edit source]
  • Package: ‘RBGL’
  • Version: 1.26.0
  • Date: 2010-10-26
  • Title: An interface to the BOOST graph library
  • Author: Vince Carey, Li Long, R. Gentleman
  • Maintainer: Li Long <li.long@isb-sib.ch>
  • Depends: graph, methods
  • Suggests: Rgraphviz
  • Description: A fairly extensive and comprehensive interface to the graph algorithms contained in the

BOOST library.

Usage

[edit | edit source]

kCores(g, EdgeType=c("in", "out"))

Arguments

[edit | edit source]
  • g: an instance of the graph class’
  • EdgeType: what types of edges to be considered when g is directed

Details

[edit | edit source]

The implementation is based on the algorithm by V. Batagelj and M. Zaversnik, 2002. The example snacoreex.gxl is in the paper by V. Batagelj and M. Zaversnik, 2002.

Value

[edit | edit source]

A vector of the core numbers for all the nodes in g.

Examples

[edit | edit source]
library(RBGL)
con1 <- file(system.file("XML/snacoreex.gxl",package="RBGL"))
kcoex <- fromGXL(con1)
close(con1)
kCores(kcoex)
A C B E F D G H J K I L M N O P Q R S T U 
1 2 1 2 3 3 3 3 3 3 3 3 2 2 1 1 2 2 2 2 0 
con2 <- file(system.file("XML/conn2.gxl",package="RBGL"))
kcoex2 <- fromGXL(con2)
close(con2)
kCores(kcoex2)
A B C D E G H F 
3 3 3 3 3 3 3 3 
kCores(kcoex2, "in")
A B C D E G H F 
0 0 0 0 0 0 0 0 
kCores(kcoex2, "out")
A B C D E G H F 
0 0 0 0 0 0 0 0 

References

[edit | edit source]

BATAGELJ, V. & MRVAR, A. (2002). Generalized Cores. Journal of ACM, v. 5, 2002.

BATAGELJ, V. & MRVAR, A. (2003). An O(m) Algorithm for Cores Decomposition of Networks.

CAREY, V. & LONG, L. & GENTLEMAN, R. (2010). An Interface to the BOOST graph library. http://www.bioconductor.org/packages/release/bioc/html/RBGL.html.

BUTTS, C. T. (2010). Tools for Social Networks Analysis. http://erzuli.ss.uci.edu/R.stuff/.