Data Mining Algorithms In R/Frequent Pattern Mining/The Eclat Algorithm

From Wikibooks, open books for an open world
< Data Mining Algorithms In R‎ | Frequent Pattern Mining
Jump to: navigation, search

Introduction[edit]

The Eclat Algorithm[edit]

The Eclat algorithm is used to perform itemset mining. Itemset mining let us find frequent patterns in data like if a consumer buys milk, he also buys bread. This type of pattern is called association rules and is used in many application domains.

The basic idea for the eclat algorithm is use tidset intersections to compute the support of a candidate itemset avoiding the generation of subsets that does not exist in the prefix tree.

Algorithm[edit]

The Eclat algorithm is defined recursively. The initial call uses all the single items with their tidsets. In each recursive call, the function IntersectTidsets verifies each itemset-tidset pair \left\langle {X,t(X)} \right\rangle with all the others pairs \left\langle {Y,t(Y)} \right\rangle to generate new candidates N_{XY}. If the new candidate is frequent, it is added to the set P_{X}. Then, recursively, it finds all the frequent itemsets in the X branch. The algorithm searches in a DFS manner to find all the frequent sets.

Implementation[edit]

The eclat algorithm can be found in the arule package of R system.

   * R package: arules
   * Method: eclat
   * Documentation : arules package


Usage

eclat(data, parameter = NULL, control = NULL)

Arguments

data 
object of class transactions or any data structure which can be coerced into transactions (e.g., binary matrix, data.frame).
parameter 
object of class ECparameter or named list (default values are: support 0.1 and maxlen 5)
control 
object of class ECcontrol or named list for algorithmic controls.

Value

Returns an object of class itemsets


Example

data("Adult")
 
## Mine itemsets with minimum support of 0.1.
itemsets <- eclat(Adult, parameter = list(supp = 0.1, maxlen = 15))

Visualization[edit]

The arules package implements some visualization methods for itemsets, which are the return type for the eclat algorithm. Here are some examples:

Example 1

data("Adult") 
 
## Mine frequent itemsets with Eclat.
fsets <- eclat(Adult, parameter = list(supp = 0.5))
 
## Display the 5 itemsets with the highest support.
fsets.top5 <- SORT(fsets)1:5?
inspect(fsets.top5)
 
## Get the itemsets as a list
as(items(fsets.top5), "list")
 
## Get the itemsets as a binary matrix
as(items(fsets.top5), "matrix")
 
## Get the itemsets as a sparse matrix, a ngCMatrix from package Matrix.
## Warning: for efficiency reasons, the ngCMatrix you get is transposed
as(items(fsets.top5), "ngCMatrix")

Example 2

## Create transaction data set.
data <- list(
  c("a","b","c"),
  c("a","b"),
  c("a","b","d"),
  c("b","e"),
  c("b","c","e"),
  c("a","d","e"),
  c("a","c"),
  c("a","b","d"),
  c("c","e"),
  c("a","b","d","e")
)
 
t <- as(data, "transactions")
 
## Mine itemsets with tidLists.
f <- eclat(data, parameter = list(support = 0, tidLists = TRUE))
 
## Get dimensions of the tidLists.
dim(tidLists(f))
 
## Coerce tidLists to list.
as(tidLists(f), "list")
 
## Inspect visually.
image(tidLists(f))
 
##Show the Frequent itemsets and respectives supports
inspect(f)
# items support
1 {b, c, e} 0.1
2 {a, b, c} 0.1
3 {a, c} 0.2
4 {b, c} 0.2
5 {c, e} 0.2
6 {a, b,d, e} 0.1
7 {a, d, e} 0.2
8 {b, d, e} 0.1
9 {a, b, d} 0.3
10 {a, d} 0.4
11 {b, d} 0.3
12 {d, e} 0.2
13 {a, b, e} 0.1
14 {a, e} 0.2
15 {b, e} 0.3
16 {a, b} 0.5
17 {a} 0.7
18 {b} 0.7
19 {e} 0.5
20 {d} 0.4
21 {c} 0.4

Use Case[edit]

To see some real example of the use of the Eclat algorithm it will be used some data from the northwind database. The northwind database is freely available for download and represents data from an enterprise. In this example it will be used the table order details from the database. The order details table is used to relate the orders with products (in a n to n relationship). The Eclat algorithm will be used to find frequent patterns from this data to see if there are any products that are bought together.

Scenario[edit]

Given the data from the order details table from the northwind database, find all the frequent itemsets with support = 0.1 and length of at least 2.

Input Data[edit]

The order details table has the fields:

ID
primary key
Order ID
foreign key from table Orders
Product ID
foreign key from table Products
Quantity
the quantity bought
Discount
the discount offered
Unit Price
the unit price of the product


To use the data, some pre-processing is necessary. The table may have many rows that belongs to the same order, so the table was converted in a way that all the rows for one order became only one row in the new table containing the product id's of the products belonging to that order. The fields ID, order id, quantity, discount and unit price was discarded. The data was saved in a txt file called northwind-orders.txt. The file was scripted in a way ready to be loaded as a list object in R.input data

Implementation[edit]

To run the example the package arules need to be loaded in R.

First, the data is loaded in a list object in R

## 1041 transactions is loaded, the listing below was shortened
## some duplicates transactions was introduced to produce some results for the algorithm
data = list(
 c("2","3","4","6","7","8","10","12","13","14","16","20","23","32","39","41","46","52","55","60","64","66","73","75","77"),
 c("11","42","72"),
 c("14","51"),
 c("41","51","65"),
 c("22","57","65"),
 ...)


Second, the eclat algorithm is used.

itemsets <- eclat(data, parameter = list(support = 0.1, minlen=2, tidLists = TRUE, target="frequent itemsets"))
parameter specification:
 tidLists support minlen maxlen            target   ext
     TRUE     0.1      2      5 frequent itemsets FALSE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

eclat - find frequent item sets with the eclat algorithm
version 2.6 (2004.08.16)         (c) 2002-2004   Christian Borgelt
create itemset ... 
set transactions ...[78 item(s), 1041 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating bit matrix ... [3 row(s), 1041 column(s)] done [0.00s].
writing  ... [4 set(s)] done [0.00s].
Creating S4 object  ... done [0.00s].


Output Data[edit]

The itemsets object holds the output of the execution of the eclat algorithm. As can be seen above, 4 sets was generated. To see the results it can be used:

inspect(itemsets)
  items   support
1 {11,           
   42,           
   72}  0.1940442
2 {42,           
   72}  0.1940442
3 {11,           
   42}  0.1940442
4 {11,           
   72}  0.1959654


Analysis[edit]

As can be seen above, there are 4 frequent itemsets as result of the eclat algorithm. This output was induced by the replication of the transaction {11, 42, 72} many times in the data. This result shows that the tuples {11,42,72},{42,72} and {11,42} has a support of 19,40%; and the tuple {11,72} has a support of 19,60%.

The product id's 11, 42 and 72 represents the products Queso Cabrales, Singaporean Hokkien Fried Mee and Mozzarella di Giovanni, respectively. So, the output of the eclat algorithm suggests a strong frequent shop pattern of buying this items together.

Variations[edit]

PPV, PrePost, and FIN Algorithm[edit]

These three algorithms were propsed by Deng et al [1] [2] [3], and are based on three novel data structures called Node-list [1], N-list [2], and Nodeset [3] respectively for facilitating the mining process of frequent itemsets. They are sets of nodes in a FP-tree with each node encoding with pre-order traversal and post-order traversal. Compared with Node-lists, N-lists and Nodesets are more efficient. This causes the efficiency of PrePost [2] and FIN [3] is higher than that of PPV [1]. See [1] [2] [3] for more details.

References[edit]

  • Hahsler, M.; Buchta, C.; Gruen, B. & Hornik, K. arules: Mining Association Rules and Frequent Itemsets. 2009.
  • Hahsler, M.; Gruen, B. & Hornik, K. arules -- A Computational Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software, 2005, 14, 1-25
  • Zaki, M. J.;Meira Jr., W. Fundamentals of Data Mining Algorithms. Cambridge University Press, 2009
  • Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. New algorithms for fast discovery of association rules. Technical Report 651, Computer Science Department, University of Rochester, Rochester, NY 14627. 1997.
  • Christian Borgelt (2003) Efficient Implementations of Apriori and Eclat. Workshop of Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA).
  • [1] Deng, Z. & Wang, Z. A New Fast Vertical Method for Mining Frequent Patterns [1]. International Journal of Computational Intelligence Systems, 3(6): 733 - 744, 2010.
  • [2] Deng, Z.; Wang, Z. & Jiang, J. A New Algorithm for Fast Mining Frequent Itemsets Using N-Lists [2]. SCIENCE CHINA Information Sciences, 55 (9): 2008 - 2030, 2012.
  • [3] Deng, Z. & Lv, S. Fast mining frequent itemsets using Nodesets [3]. Expert Systems with Applications, 41(10): 4505–4512, 2014.
    • Invalid <ref> tag; no text was provided for refs named Deng2010
    • Invalid <ref> tag; no text was provided for refs named Deng2012
    • Invalid <ref> tag; no text was provided for refs named Deng2014