Data Mining Algorithms In R/Frequent Pattern Mining/arulesNBMiner

From Wikibooks, open books for an open world
Jump to: navigation, search

Introduction[edit]

The technique to be discussed in this chapter is used in frequent itemset mining. There are several situations which people are interested on co-occurrence of two or more item of a set. It is important to establish which items co-occur, since based on them, association rules can be extracted between itemsets [1]. A typical example of an application is about a supermarket when one discovery customers who buy meat and beer also tend to buy coal. Thus, a frequent item set would be meat-beer-coal and an association rule would be customers, in general, who buy meat and beer, have more chances to buy coal.

Many works have dealt with the problem of frequent itemset mining. Most of them show the necessity of a min_support threshold, which is an itemset minimum frequency in the data and in general defined by the miner user. Besides, these studies have as goal to mine a complete set of frequent itemsets that satisfy min_support [2]. The application of a minimum support results in several assumptions which are rarely discussed or verified. One assumptions is items occur in the database following a, possibly unknown, but stable process and that the items occur in the database with roughly similar frequencies. Nevertheless, in the real world, transactions data have a frequency distribution highly skewed with almost items occurring in an infrequent way while just some of them occur with high frequency. In database where this phenomenon happens, interesting patterns are not found since some of the associated items are too infrequent to satisfy the user-specified minimum support [3].

Some algorithms such as TFP were developed in a way that a user do not need determine a min_support, however, he needs to inform the minimum size of itemset (min_l) and the number of itemset which he desires mining (k). Furthermore, the TFP algorithms just mine frequent closed itemset.[2]. Again, the user has a parameter (min_l) which he should specify before mining the data, what is a subtle decision.

Therefore, this chapter presents an algorithm, which is implemented at R-package and uses a simple stochastic model (Negative Binomal model or NB-model) to estimate a minimum support utilizing knowledge of the process which generates transaction data and allows for highly skewed frequency distributions. The name of package in R program is arulesNBMiner that is the Java implementation of a depth first search algorithm to mine NB-frequent itemsets of NB-precise rules [4]. Beside the algorithm utilize the information contained in own data structure to estimate the minimum support, it uses a precision limit to estimate min_support and for each k-itemset plus 1 extension it calculates a different minimum support.

Technique to be discussed[edit]

Algorithm[edit]

Function NBSelect\left(l,\ L,\ n,\ \tilde{k},\ \tilde{a}',\ \pi\right)
1. rmax = max(c.count) in L
2. rrescale = sum(c.count) in L
3. for each tuple [c,c.count] ∈ L do nobs [c.count]++
4. do
5. \text{precision}=\frac{(1-\vert l\vert)P[R_1\ge p\;\vert\;k=\tilde k,\;a=r_\text{rescale}\cdot\tilde a']}{\sum_{r=p}^{r_\text{max} } n_{ob} s[r]}
6. while (precision ≥ π ∧ (p−−) > 0)
7. return {c| [c,c.count] ∈ L ∧ c.count > p}
where
l = the pattern for which candidate items are selected
L = a data structure which holds count information for items which co-occur with pattern l; 
we use tuples c, c.count , where c represents a candidate  item and c.count the count.
n = the total number of available items
\tilde{k}\text{ and }\tilde{a}' = estimated parameters for the data set
π = user-specified precision threshold

Implementation (description of used modules and packages)[edit]

The first thing one has to do to use NBMiner is installing the NBMiner package. This package has three dependencies packages: arules (http://cran.r-project.org/web/packages/arules/index.html) which provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules), Matrix (http://cran.r-project.org/web/packages/Matrix/index.html) which is classes and methods for dense and sparse matrices and operations on them using Lapack and SuiteSparse, and lettice (http://cran.r-project.org/web/packages/lattice/index.html) which is a framework for data visualization developed at the Bell Labs. Besides, the user has to install on computer the Sun Java(TM) Development Kit (JDK) 6, since the NBMiner uses a package called rJava which is part of JDK. This must be installed in the user operational system.

Installation of the dependencies packages can be performed within the R environment using the function "install.packages(“package name”)". The name of NBMiner package is arulesNBMiner.

install.packages(“arulesNBMiner”)  

After the user installed the necessaries packages, he must load them. This can be done using the function "library(package name)".

library(arulesNBMiner)     

However, before we show how to use the NBMiner package, it is necessary to show how to load the data. Here, we are using a data on csv format. In this format each line is a transaction like the example below:

Employment Education Marital Occupation Sex Accounts
Private College Separated Service Female USA
Private Associate Unmarried Transport Male Jamaica
Private HSgrad Divorced Clerical Male USA
Private Bachelor Civil Repair Male USA
Private College Civil Executive Male USA
Private HSgrad Civil Service Male USA

Using the function "read.table()" on R environment, we are able to load the data. This function creates a "data.frame" object. Its syntax is:

object name<-read.table(“filename”, header=TRUE/FALSE,sep=”,”)

The argument header indicates the data has or does not a header while the sep indicates which character is being used to separate the elements. Later, the user has to transform this data frame object in a transaction object. For that, he can use the function "as()". Its syntax is:

object name<-as(data frame object,”transactions”)

It is important to mention the user has to load the NBMiner package before he uses the function as with parameter transaction. Furthermore, there is another data format that some R functions can read and transform it in a transaction object. The user can utilize a binary format like the example below:

1 1 0 0 0 1 0 0 0
1 0 1 0 0 0 1 0 0
1 0 0 1 0 0 0 1 0
1 0 0 0 1 0 0 0 1
1 1 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 1
1 0 0 0 1 0 0 0 1

Here, each line represents a transaction and each column represents a term of the database. The number 1 means presence of that term in that transaction, while 0 means no presence. To read this data, user can first utilize the function "scan()". It creates a vector object, and its syntax is:

object name<-scan(“file name”,sep=”,”)

After that, it is necessary to transform the vector object in a matrix object. This is performed using the function "dim()". Its syntax is:

dim(vector object)<-c(dim1,dim2) 

Dim1 and Dim2 are the dimensions of the matrix which the user want creating. If the user wants, he can name each row and column of the matrix using the function "dimnames()". Its syntax is:

dimnames(matrix object)<-list(pridim=c(names),segdim=c(names))

Where pridim is a vector with first dimension names, and segdim is a vector with second dimension names. Finally, this object can be transformed in transaction object using the function "as()". Its syntax is:

object name<-as(matrix object,”ItemMatrix”)

Soon after the user loaded the package and the data, it is necessary to create the parameters object using the function of NBMiner called "NBMinerParameters()". This function reads the data and extracts from them some parameters which will be used in the next step. The syntax of this function is:

object name<-NBMinerParameters(data object, pi=#, theta=#, maxlen=#, minlen=#, trim=#, verb=TRUE/FALSE, plot=TRUE/FALSE, rules=TRUE/FALSE)
where: 
# = numbers
data object = the data as an object of class transaction
pi = precision threshold
theta = pruning parameter
maxlen = maximal number of items in found itemsets (default = 5)
minlen = minimum number of items in found itemsets (default = 1)
trim = fraction of incidences to trim off the tail of the frequency distribution of the data
verbose = use verbose output for the estimation procedure
plot = plot the model
rules = mine NB-precise rules instead of NB-frequent itemsets

After the user perform this procedure, he is able to run the algorithm to mine his data. For that, one will use the function called "NBMiner()". Its syntax is:

object name<-NBMiner(data object, parameter=object parameter, control = list(verb=TRUE, debug=FALSE))

If the user uses the option rules=TRUE when he created the object parameter, the algorithm will mine the NB-precise rules. Otherwise, the same algorithm will mine the NB-frequent itemsets.

Finally, we summarized the necessaries commands to use the NBMiner package for mining tasks using the first data structure mentioned above, these are below:

library(package name)
object name<-read.table(“filename”, header=TRUE/FALSE,sep=”,”)
object name<-as(data frame object,”transaction”)
object name<-NBMinerParameters(data object, pi=#, theta=#, maxlen=#, minlen=#, trim=#, verb=TRUE/FALSE, plot=TRUE/FALSE, rules=TRUE/FALSE)
object name<-NBMiner(data object, parameter=object parameter, control = list(verb=TRUE, debug=FALSE

Visualization[edit]

To visualize data information contained in a “data.frame” object just use the object name. Here, we use table as object name:

table = read.table("data",sep=",",header=TRUE)
table
# Employment Education Marital Occupation Sex Accounts
1 Private College Separated Service Female USA
2 Private Associate Unmarried Transport Male Jamaica
3 Private HSgrad Divorced Clerical Male USA
4 Private Bachelor Civil Repair Male USA
5 Private College Civil Executive Male USA
6 Private Hsgrad Civil Service USA

However, when the same principle is used in “transactions” format, we have just the number of transactions and items generated by the "as()" function:

tableT = as(table,"transactions")
tableT
 transactions in sparse format with
 6 transactions (rows) and
 18 items (columns)

Use “inspect()” function to visualize the result of the conversion to the “transactions” format:

inspect(tableT)
# items transactionID
1 {Employment=Private,Education=College,Marital=Separated,Occupation=Service,Sex=Female,Accounts=USA} 1
2 {Employment=Private,Education=Associate,Marital=Unmarried,Occupation=Transport,Sex=Male,Accounts=Jamaica} 2
(...)

To summarize the data information in "transaction" format use "summary()".

summary(tableT)
transactions as itemMatrix in sparse format with
6 rows (elements/itemsets/transactions) and
18 columns (items) and a density of 0.3333333 
most frequent items:
Employment=Private    Sex=Male    Accounts=USA 
        6                5            5 
Marital=Civil    Education=College  (Other) 
        3                2            15 
element (itemset/transaction) length distribution:
sizes
6 
6 
  Min.    1st Qu.  Median  Mean   3rd Qu.   Max. 
     6       6       6       6       6       6 
includes extended item information - examples:
       labels                   variables       levels
1       Employment=Private      Employment      Private
2       Education=Associate     Education       Associate
3       Education=Bachelor      Education       Bachelor
includes extended transaction information - examples:
transaction             ID
1                       1
2                       2
3                       3

To see labels of the items generated in the conversion to transaction format use “itemInfo()”:

itemInfo(tableT)
        labels           variables                levels
1       Employment=Private      Employment              Private
2       Education=Associate     Education               Associate
3       Education=Bachelor      Education               Bachelor
4       Education=College       Education               College
5       Education=HSgrad        Education               HSgrad
6       Marital=Civil           Marital                 Civil
7       Marital=Divorced        Marital                 Divorced
8       Marital=Separated       Marital                 Separated
9       Marital=Unmarried       Marital                 Unmarried
10      Occupation=Clerical     Occupation              Clerical
11      Occupation=Executive    Occupation              Executive
12      Occupation=Repair       Occupation              Repair
13      Occupation=Service      Occupation              Service
14      Occupation=Transport    Occupation              Transport
15      Sex=Female              Sex                     Female
16      Sex=Male                Sex                     Male
17      Accounts=Jamaica        Accounts                Jamaica
18      Accounts=USA            Accounts                USA

Use “labels()” to see only the labels of the itens and transactions:

labels(tableT)
$items
[1]     "Employment=Private"    "Education=Associate"           "Education=Bachelor"   
[4]     "Education=College"     "Education=HSgrad"              "Marital=Civil"        
[7]     "Marital=Divorced"      "Marital=Separated"             "Marital=Unmarried"    
[10]    "Occupation=Clerical"   "Occupation=Executive"          "Occupation=Repair"    
[13]    "Occupation=Service"    "Occupation=Transport"          "Sex=Female"           
[16]    "Sex=Male"              "Accounts=Jamaica"              "Accounts=USA"
$transactionID
[1] "1" "2" "3" "4" "5" "6"

To see NB mined results use the same commands above mentioned to “transactions” data:

paramA <- NBMinerParameters(tableT, trim = 0.01, pi = 0.999, theta = 0.8, rules = TRUE, plot = FALSE, verbose = FALSE, minlen=3, maxlen=5)
tableNB <- NBMiner(tableT, parameter = paramA, control = list(verb = FALSE, debug=FALSE))
inspect(head(tableNB))
        lhs                                     rhs                             precision
1       {Education=HSgrad, Sex=Male}            => {Employment=Private}      0.9991467
2       {Sex=Male, Accounts=USA}                => {Marital=Civil}           0.9999421
3       {Sex=Male, Accounts=USA}                => {Education=HSgrad}        0.9977636
4       {Education=College, Accounts=USA}       => {Employment=Private}              0.9982934
5       {Marital=Civil, Accounts=USA}           => {Sex=Male}                0.9999752
6       {Employment=Private, Sex=Male}          => {Accounts=USA}            0.9999948

Where "lhs" is the antecedent of the rules and de "rhs" is the consequent of the rules.

We can also show the data distribution in the space with “image()” function:

image(tableNB)

File:ImageFunction.png

Case study[edit]

Scenario[edit]

It contains artificial data created by Graham Williams (developer of Rattle) and is supplied with Rattle. To quote from the Rattle documentation: "It consists of 2,000 fictional clients who have been audited, perhaps for compliance with regard to the amount of a tax refund that is being claimed. For each case an outcome is recorded (whether the taxpayer's claims had to be adjusted or not) and any amount of adjustment that resulted is also recorded."

Input data[edit]

Available on http://cs.anu.edu.au/Student/comp3420/mining/audit.csv.

Execution[edit]

table = read.table("audit.csv",sep=",",header=TRUE)
trans = as(table,"transactions")
paramA = NBMinerParameters(tableT, trim = 0.01, pi = 0.999, theta = 0.8, rules = TRUE, plot = FALSE, verbose = FALSE, minlen=3, maxlen=5)
transNB = NBMiner(tableT, parameter = paramA, control = list(verb = FALSE, debug=FALSE))

Output[edit]

transNB
set of 18158 rules 
tableNB = as(trans,"data.frame")
write.table(tableNB,file="auditNB.csv",sep=",")

Analysis[edit]

rules consequent precedent A precedent B precedent C precedent D
92 Accounts=China Employment=PSState Education=Master Occupation=Professional
4721 Accounts=China Employment=PSState Education=Master
4762 Accounts=China Employment=PSState Occupation=Professional
4857 Accounts=China Education=Master Marital=Civil Occupation=Professional Sex=Male
5871 Accounts=China Education=Master Occupation=Professional
6131 Accounts=China Marital=Civil Occupation=Professional
10269 Accounts=China Education=Master Occupation=Professional Sex=Male
10386 Accounts=China Education=Master Marital=Civil Occupation=Professional
14791 Accounts=China Marital=Civil Occupation=Professional Sex=Male

Based on the data above we can note that when the Marital state is “Civil” and the "Occupation" is "Professional" we have an "Chinese" account.

References[edit]

  • [1]Mohammed j. Zaki and Wagner Meira Jr. Fundamentals of Data Mining Algorithms. Chaper 11 – Itemset Mining, 74-93
  • [2]Jianyong Wang, Jiawei Han, Ying Lu and Petre Tzvetkov. TFP: An efficient algorithm for mining Top-K frequent closed itemsets. IEEE Transactions on Knowledge and Data Engineering, 17(5):652-665, May 2005
  • [3]Michael Hahsler. A model-based frequency constraint for mining associations from transaction data. Data Mining and Knowledge Discovery, 13(2):137-166, September 2006.
  • [4]http://cran.fiocruz.br/web/packages/arulesNBMiner/index.html