Data Mining Algorithms In R/Sequence Mining/SPADE

From Wikibooks, open books for an open world
Jump to: navigation, search

Frequent Sequence Mining is used to discover a set of patterns shared among objects which have between them a specific order. For instance, a retail shop may possess a transaction database which specifies which products were acquired by each customer over time. In this case, the store may use Frequent Sequence Mining to find that 40% of the people who bought the first volume of Lord of the Rings came back to buy the second volume a month later. This kind of information may be used to support directed advertising campaigns or recommendation systems.

Another application domain where Frequent Sequence Mining may be used is Web click log analysis in Information Retrieval systems, in which case the system performance may be refined by analyzing the sequence of interactions that the user exposed while searching or browsing for a specific information. This kind of usage becomes especially clear when we consider the huge amount of data obtained by industrial search engines in the form of query logs. Google alone was reported to answer 5.42 billion queries during December 2008 (Telecom Paper, 2009)

In biology Frequent Sequence Mining may be used to extract information hidden in DNA sequences. Sequence databases in biology are often huge and complex due to variations from genetic mutations and evolution (Li et al., 2007). For example, Frequent Sequence Mining can be used to extract patterns which may be determinant to the development of genetic conditions.

A sequence α is an ordered list of events <a1,a2,...,am>. An event is a non-empty unordered set of items ai ⊆ i1,i2,...,ik. A sequence α = <a1,a2,...,am> is a subsequence of β = < b1, b2,...,bn > if and only if exists i1,i2,...,im such that 1 ≤ i1 < i2 < ... < im ≤ n and a1 ⊆ bi1, a2 ⊆ bi2 and am ⊆ bim. Given a sequence database D = s1,s2,...,sn, the support of a sequence α is the number of sequences of D which contains α as a subsequence. If the support of α is bigger than a threshold maxsup, then α is a frequent sequence (Peng and Liao, 2009).

Algorithm[edit]

An algorithm to Frequent Sequence Mining is the SPADE (Sequential PAttern Discovery using Equivalence classes) algorithm. It uses a vertical id-list database format, where we associate to each sequence a list of objects in which it occurs. Then, frequent sequences can be found efficiently using intersections on id-lists. The method also reduces the number of databases scans, and therefore also reduces the execution time.

The first step of SPADE is to compute the frequencies of 1-sequences, which are sequences with only one item. This is done in a single database scan. The second step consists of counting 2-sequences. This is done by transforming the vertical representation into an horizontal representation in memory, and counting the number of sequences for each pair of items using a bidimensional matrix. Therefore, this step can also be executed in only one scan.

Subsequent n-sequences can than be formed by joining (n-1)-sequences using their id-lists. The size of the id-lists is the number of sequences in which an item appears. If this number is greater than minsup, the sequence is a frequent one. The algorithm stops when no frequent sequences can be found anymore. The algorithm can use a breadth-first or a depth-first search method for finding new sequences (Zaki, 2001)

Implementation[edit]

The first step is to install the arulesSequences package (Buchta and Hahsler, 2010).

> install.packages("arules")
> install.packages("arulesSequences")

To illustrate the use of CSPADE, we are going to use an example file that can be found inside the package arulesSequence. This file is listed bellow:

$ cat /usr/local/lib/R/site-library/arulesSequences/misc/zaki.txt
1 10 2 C D
1 15 3 A B C
1 20 3 A B F
1 25 4 A C D F
2 15 3 A B F
2 20 1 E
3 10 3 A B F
4 10 3 D G H
4 20 2 B F
4 25 3 A G H

Each line in the zaki.txt file is an event. The first column is the sequence id, that is, the sequence to which this event belongs. The second column is the event timestamp, which is the moment in time when the event has occurred. The third column is the number n of items in the event, and it is followed by n additional columns, one for each item.

First, we need to load the necessary packages:

> library(Matrix)
> library(arules)
> library(arulesSequences)

To read the zaki.txt file, issue the following commands:

> x <- read_baskets(con = system.file("misc", "zaki.txt", package = "arulesSequences"), info = c("sequenceID","eventID","SIZE"))
> as(x, "data.frame")

       items sequenceID eventID SIZE
1      {C,D}          1      10    2
2    {A,B,C}          1      15    3
3    {A,B,F}          1      20    3
4  {A,C,D,F}          1      25    4
5    {A,B,F}          2      15    3
6        {E}          2      20    1
7    {A,B,F}          3      10    3
8    {D,G,H}          4      10    3
9      {B,F}          4      20    2
10   {A,G,H}          4      25    3

Next, we execute the CSPADE algorithm:

> s1 <- cspade(x, parameter = list(support = 0.4), control = list(verbose = TRUE))

Note that we executed the cpade algorithm with the data contained in zaki object. We have set the support parameter to 0.4, and also have instructed the algorithm to show a verbose output.

The algorithm output will be the following:

preprocessing ... 1 partition(s), 0 MB [0.046s]
mining transactions ... 0 MB [0.022s]
reading sequences ... [0.07s]

total elapsed time: 0.138s

Visualization[edit]

We can use the command summary and as to see the results:

cspade> summary(s1)
set of 18 sequences with

most frequent items:
      A       B       F       D (Other) 
     11      10      10       8      28 

most frequent elements:
    {A}     {D}     {B}     {F}   {B,F} (Other) 
      8       8       4       4       4       3

element (sequence) size distribution:
sizes
1 2 3
8 7 3

sequence length distribution:
lengths
1 2 3 4
4 8 5 1

summary of quality measures:
    support
 Min.   :0.5000
 1st Qu.:0.5000
 Median :0.5000
 Mean   :0.6528
 3rd Qu.:0.7500
 Max.   :1.0000

mining info:
 data ntransactions nsequences support
    x            10          4     0.4

cspade> as(s1, "data.frame")
          sequence support
1            <{A}>    1.00
2            <{B}>    1.00
3            <{D}>    0.50
4            <{F}>    1.00
5          <{A,F}>    0.75
6          <{B,F}>    1.00
7        <{D},{F}>    0.50
8      <{D},{B,F}>    0.50
9        <{A,B,F}>    0.75
10         <{A,B}>    0.75
11       <{D},{B}>    0.50
12       <{B},{A}>    0.50
13       <{D},{A}>    0.50
14       <{F},{A}>    0.50
15   <{D},{F},{A}>    0.50
16     <{B,F},{A}>    0.50
17 <{D},{B,F},{A}>    0.50
18   <{D},{B},{A}>    0.50

This output shows (1) the list of the most frequent isolated items (A,B, ..), (2) the list of the most frequent set of items that occur in events (referred to as elements), (3) the distribution of the sizes of the set of items, (4) the distribution of the number of events in a sequence (referred to as sequence length), (5) the minimum, maximum, mean and median support values, and (6) the set of frequent sequences mined ordered by its support value.

Case Study[edit]

Scenario[edit]

In this case study we analyse the application of the CSPADE algorithm to a Tag Recommendation problem. Tags are keywords assigned by users to items in the context of the Web 2.0. A Tag Recommendation System is used to suggest new tags to users with the objective of enhancing their browsing experience and enrich item description. The dataset used in this case study was obtained from Delicious using its public time-line, which shows bookmarks from all the system users during a given period of time. The SPADE algorithms were executed and some results are presented in the discussion bellow.

Datasets[edit]

The dataset used in this case study was collected from Delicious in October 2009. We collected periodically the Delicious public time-line which shows the bookmarks from all the system users. Each bookmark consists of a user, the URL which was bookmarked, and a set of tags that user chose to describe the URL.

We show some bookmark examples:

gmenezes http://www.traderslog.com/forum/ 5 education investment forex CFD trading
gmenezes http://bikebins.com/index.html 6 pannier bike bicycle cycling bikes commuting
osvaldo http://www.noycefdn.org/ecrwresources.php 6 literacy math foundation education webdesign professionaldevelopment

In this example, the user gmenezes has bookmarked the URL http://www.traderslog.com/forum/ and used 5 tags to describe its content: education investment forex CFD trading.

The public time-line shows the system bookmarks in a time sequence, so that we can obtain sequences of bookmarks for specific users. We can use these sequences as input for the CSPADE algorithm. In this case, each bookmark is an event, and each tag is an item. In other words, we will mine temporal patterns with the objective of generating rules that predict useful tags for a specific user by using the wisdom of crowds.

Before appling the algorithm to the data we had to perform some pre-processing in our raw data. First, we extracted a sample, which consisted of 31,289 sequential bookmarks. Next, we performed data cleansing and duplicate removal. We ended up with 7,559 sequential bookmarks. We grouped each user`s bookmarks as sequences and each individual bookmark as an event. For instance, the example above yields:

1 1 5 education investment forex CFD trading
1 2 6 pannier bike bicycle cycling bikes commuting
2 1 6 literacy math foundation education webdesign professionaldevelopment

The dataset used can be downloaded from here.


Execution[edit]

We used the dataset described above in the experiments. To execute the algorithm we first execute read_baskets() to load the dataset file from disk as temporal transaction data. Note that we need to load the required libraries (as above).

n <- read_baskets(con = system.file("misc", "delicious.sequence", package = "arulesSequences"), info = c("sequenceID","eventID","SIZE"))

We can see data statistics with the command summary():

> summary(n)
transactions as itemMatrix in sparse format with
 7559 rows (elements/itemsets/transactions) and
 7496 columns (items) and a density of 0.0004482878 

most frequent items:
     design       tools        blog   webdesign inspiration     (Other) 
        469         301         233         229         220       23949 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
2283 1432 1172  825  560  343  230  273  171  100   60   34   25   14    5    5 
  17   18   19   20 
   5    7    8    7 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.00    3.00    3.36    4.00   20.00 

includes extended item information - examples:
  labels
1      |
2      -
3      ,

includes extended transaction information - examples:
  sequenceID eventID SIZE
1          1       1    2
2          2       1    5
3          3       1    3

Next, we execute cspade() to generate the rules. We have set a support of 0.002 to obtain a larger number of patterns.

s1 <- cspade(n, parameter = list(support = 0.002), control = list(verbose = TRUE))


Caveats[edit]

The use of higher support levels in this dataset does not generate any rule and can cause non-intuitive errors. For example, if the support is set to 0.1, the system outputs:

> s1 <- cspade(n, parameter = list(support = 0.1), control = list(verbose = TRUE))
parameter specification:
support : 0.1
maxsize :  10
maxlen  :  10

algorithmic control:
bfstype : FALSE
verbose :  TRUE
summary : FALSE

preprocessing ... 1 partition(s), 0.18 MB [0.05s]
mining transactions ... can't open data file: No such file or directory
Error in cspade(n, parameter = list(support = 0.1), control = list(verbose = TRUE)) : 
  system invocation failed

Output[edit]

We can see the generated rules by issuing the command as().

as(s1, "data.frame")
(...)
845                        <{webdesign},{design}> 0.004675877
846            <{inspiration,webdesign},{design}> 0.001912859
847                 <{design,webdesign},{design}> 0.004250797
848                <{design,typography},{design}> 0.002337938
849                     <{design,tools},{design}> 0.002337938
850        <{inspiration},{inspiration},{design}> 0.001912859
851             <{inspiration},{design},{design}> 0.002125399
852               <{design,inspiration},{design}> 0.004675877
853             <{design},{inspiration},{design}> 0.001912859
854                  <{inspiration,art},{design}> 0.001912859
855 <{design,inspiration},{inspiration},{design}> 0.001912859
856      <{design},{design,inspiration},{design}> 0.001912859
857                  <{design},{design},{design}> 0.004250797
858                      <{design,blog},{design}> 0.002337938
859                       <{design,art},{design}> 0.002550478
860         <{design},{design},{design},{design}> 0.002337938
861                      <{design},{design,blog}> 0.001912859
862                           <{design,art,blog}> 0.001912859
863                       <{design},{design,art}> 0.002763018
864                          <{art},{design,art}> 0.002125399
865                               <{culture,art}> 0.001912859
866                                 <{css},{css}> 0.001912859
867                              <{design},{css}> 0.002125399
868                                <{blog,blogs}> 0.003400638
869                             <{blog,blogging}> 0.001912859
(...)

To see the complete set of rules, download it from here.


Analysis[edit]

We have observed that CSPADE found many trivial sequences from user behaviour. For example, it has found many unitary sequences, such as <{design}>, <{ajax}>, <{css}>, among others. These unitary sequences are really frequently used, but they may not be useful in the particular application, which is Tag Recommendation.

Furthermore, other trivial sequences were found, such as <{design}.{design}> and <{webdesign},{design}>. These sequences indicates that the same users tend to bookmarks pages in the same subject subsequently. However, some interesting patters were also found. We can cite <{library},{books}>, <{javascript},{ajax}> and <{video},{youtube}>.

We can also observe that many frequent patterns are related to design, art and web_development. These tags are also the most popular tags in the whole Delicious system, as can be seen here.

References[edit]

  1. ^ Buchta, C., Hahsler, M., 2010. "arulesSequences: Mining frequent sequences". R package version 0.1-11. http://CRAN.R-project.org/package=arulesSequences
  2. ^ Li, K.-C., Chang, D.-J., Rouchka, E. C., Chen, Y. Y., 2007. "Biological sequence mining using plausible neural network and its application to exon/intron boundaries prediction". In: CIBCB. IEEE, pp. 165–169.
  3. ^ Peng, W.-C., Liao, Z.-X., 2009. "Mining sequential patterns across multiple sequence databases". Data Knowl. Eng. 68 (10), 1014–1033.
  4. ^ Telecom Paper, January 2009. "Google query volume".
  5. ^ Zaki, M. J., 2001. "Spade: An efficient algorithm for mining frequent sequences". In: Machine Learning. Vol. 42. pp. 31–60.