Data Mining Algorithms In R/Frequent Pattern Mining/The Eclat Algorithm

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Introduction

The Eclat Algorithm[edit | edit source]

The Eclat algorithm is used to perform itemset mining. Itemset mining let us find frequent patterns in data like if a consumer buys milk, he also buys bread. This type of pattern is called association rules and is used in many application domains.

The basic idea for the eclat algorithm is use tidset intersections to compute the support of a candidate itemset avoiding the generation of subsets that does not exist in the prefix tree. It was originally proposed by Zaki, Parthasarathy et al.

Algorithm[edit | edit source]

The Eclat algorithm is defined recursively. The initial call uses all the single items with their tidsets. In each recursive call, the function IntersectTidsets verifies each itemset-tidset pair with all the others pairs to generate new candidates . If the new candidate is frequent, it is added to the set . Then, recursively, it finds all the frequent itemsets in the branch. The algorithm searches in a DFS manner to find all the frequent sets.

Implementation[edit | edit source]

The eclat algorithm can be found in the arule package of R system.

   * R package: arules
   * Method: eclat
   * Documentation : arules package


Usage

eclat(data, parameter = NULL, control = NULL)

Arguments

data
object of class transactions or any data structure which can be coerced into transactions (e.g., binary matrix, data.frame).
parameter
object of class ECparameter or named list (default values are: support 0.1 and maxlen 5)
control
object of class ECcontrol or named list for algorithmic controls.

Value

Returns an object of class itemsets


Example

data("Adult")

## Mine itemsets with minimum support of 0.1.
itemsets <- eclat(Adult, parameter = list(supp = 0.1, maxlen = 15))

Visualization[edit | edit source]

The arules package implements some visualization methods for itemsets, which are the return type for the eclat algorithm. Here are some examples:

Example 1

data("Adult") 

## Mine frequent itemsets with Eclat.
fsets <- eclat(Adult, parameter = list(supp = 0.5))

## Display the 5 itemsets with the highest support.
fsets.top5 <- sort(fsets)
inspect(fsets.top5)

## Get the itemsets as a list
as(items(fsets.top5), "list")

## Get the itemsets as a binary matrix
as(items(fsets.top5), "matrix")

## Get the itemsets as a sparse matrix, a ngCMatrix from package Matrix.
## Warning: for efficiency reasons, the ngCMatrix you get is transposed
as(items(fsets.top5), "ngCMatrix")

Example 2

## Create transaction data set.
data <- list(
  c("a","b","c"),
  c("a","b"),
  c("a","b","d"),
  c("b","e"),
  c("b","c","e"),
  c("a","d","e"),
  c("a","c"),
  c("a","b","d"),
  c("c","e"),
  c("a","b","d","e")
)

t <- as(data, "transactions")

## Mine itemsets with tidLists.
f <- eclat(data, parameter = list(support = 0, tidLists = TRUE))

## Get dimensions of the tidLists.
dim(tidLists(f))

## Coerce tidLists to list.
as(tidLists(f), "list")

## Inspect visually.
image(tidLists(f))

##Show the Frequent itemsets and respectives supports
inspect(f)
# items support
1 {b, c, e} 0.1
2 {a, b, c} 0.1
3 {a, c} 0.2
4 {b, c} 0.2
5 {c, e} 0.2
6 {a, b,d, e} 0.1
7 {a, d, e} 0.2
8 {b, d, e} 0.1
9 {a, b, d} 0.3
10 {a, d} 0.4
11 {b, d} 0.3
12 {d, e} 0.2
13 {a, b, e} 0.1
14 {a, e} 0.2
15 {b, e} 0.3
16 {a, b} 0.5
17 {a} 0.7
18 {b} 0.7
19 {e} 0.5
20 {d} 0.4
21 {c} 0.4

Use Case[edit | edit source]

To see some real example of the use of the Eclat algorithm it will be used some data from the northwind database. The northwind database is freely available for download and represents data from an enterprise. In this example it will be used the table order details from the database. The order details table is used to relate the orders with products (in a n to n relationship). The Eclat algorithm will be used to find frequent patterns from this data to see if there are any products that are bought together.

Scenario[edit | edit source]

Given the data from the order details table from the northwind database, find all the frequent itemsets with support = 0.1 and length of at least 2.

Input Data[edit | edit source]

The order details table has the fields:

ID
primary key
Order ID
foreign key from table Orders
Product ID
foreign key from table Products
Quantity
the quantity bought
Discount
the discount offered
Unit Price
the unit price of the product


To use the data, some pre-processing is necessary. The table may have many rows that belongs to the same order, so the table was converted in a way that all the rows for one order became only one row in the new table containing the product id's of the products belonging to that order. The fields ID, order id, quantity, discount and unit price was discarded. The data was saved in a txt file called northwind-orders.txt. The file was scripted in a way ready to be loaded as a list object in R.

Implementation[edit | edit source]

To run the example the package arules need to be loaded in R.

First, the data is loaded in a list object in R

## 1041 transactions is loaded, the listing below was shortened
## some duplicates transactions was introduced to produce some results for the algorithm
data = list(
 c("2","3","4","6","7","8","10","12","13","14","16","20","23","32","39","41","46","52","55","60","64","66","73","75","77"),
 c("11","42","72"),
 c("14","51"),
 c("41","51","65"),
 c("22","57","65"),
 ...)


Second, the eclat algorithm is used.

itemsets <- eclat(data, parameter = list(support = 0.1, minlen=2, tidLists = TRUE, target="frequent itemsets"))
parameter specification:
 tidLists support minlen maxlen            target   ext
     TRUE     0.1      2      5 frequent itemsets FALSE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

eclat - find frequent item sets with the eclat algorithm
version 2.6 (2004.08.16)         (c) 2002-2004   Christian Borgelt
create itemset ... 
set transactions ...[78 item(s), 1041 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating bit matrix ... [3 row(s), 1041 column(s)] done [0.00s].
writing  ... [4 set(s)] done [0.00s].
Creating S4 object  ... done [0.00s].


Output Data[edit | edit source]

The itemsets object holds the output of the execution of the eclat algorithm. As can be seen above, 4 sets was generated. To see the results it can be used:

inspect(itemsets)
  items   support
1 {11,           
   42,           
   72}  0.1940442
2 {42,           
   72}  0.1940442
3 {11,           
   42}  0.1940442
4 {11,           
   72}  0.1959654


Analysis[edit | edit source]

As can be seen above, there are 4 frequent itemsets as result of the eclat algorithm. This output was induced by the replication of the transaction {11, 42, 72} many times in the data. This result shows that the tuples {11,42,72},{42,72} and {11,42} has a support of 19,40%; and the tuple {11,72} has a support of 19,60%.

The product id's 11, 42 and 72 represents the products Queso Cabrales, Singaporean Hokkien Fried Mee and Mozzarella di Giovanni, respectively. So, the output of the eclat algorithm suggests a strong frequent shop pattern of buying this items together.

Variations[edit | edit source]

PPV, PrePost, and FIN Algorithm[edit | edit source]

These three algorithms were proposed by Deng et al [1] [2] [3], and are based on three novel data structures called Node-list [1], N-list [2], and Nodeset [3] respectively for facilitating the mining process of frequent itemsets. They are sets of nodes in a FP-tree with each node encoding with pre-order traversal and post-order traversal. Compared with Node-lists, N-lists and Nodesets are more efficient. This causes the efficiency of PrePost [2] and FIN [3] is higher than that of PPV [1]. See [1] [2] [3] for more details.

References[edit | edit source]

  • Hahsler, M.; Buchta, C.; Gruen, B. & Hornik, K. arules: Mining Association Rules and Frequent Itemsets. 2009.
  • Hahsler, M.; Gruen, B. & Hornik, K. arules -- A Computational Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software, 2005, 14, 1-25
  • Mohammed Javeed Zaki, Srinivasan Parthasarathy, Wei Li. A Localized Algorithm for Parallel Association Mining. SPAA 1997: 321-330.
  • Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li. New Algorithms for Fast Discovery of Association Rules. KDD 1997: 283-286.
  • Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li. Parallel Algorithms for Discovery of Association Rules. Data Min. Knowl. Discov. 1(4): 343-373 (1997).
  • Christian Borgelt (2003) Efficient Implementations of Apriori and Eclat. Workshop of Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA).
  1. a b c d Deng, Z. & Wang, Z. A New Fast Vertical Method for Mining Frequent Patterns [1]. International Journal of Computational Intelligence Systems, 3(6): 733 - 744, 2010.
  2. a b c d Deng, Z.; Wang, Z. & Jiang, J. A New Algorithm for Fast Mining Frequent Itemsets Using N-Lists [2]. SCIENCE CHINA Information Sciences, 55 (9): 2008 - 2030, 2012.
  3. a b c d Deng, Z. & Lv, S. Fast mining frequent itemsets using Nodesets [3]. Expert Systems with Applications, 41(10): 4505–4512, 2014.